1. Introduction

1.1 Workshop Objectives

In this workshop, we will introduce you to spatial data analysis and visualization using R Studio. We will do so through a substantive exploration of systemic racism in traffic police stops. In particular, we will take a large (over 3,000,000 observations) dataset of Colorado traffic stops released by the Stanford Open Policing Project, and develop a simple county-level measure of the magnitude of anti-Black racial bias in traffic police stops during the year 2010. We will then display county-level variation in this bias index on a map of Colorado counties.

The workshop has several learning objectives. Among other things, you will learn how to:

  • Read in datasets (both tabular and spatial) into R Studio
  • Clean and process data in R Studio using tidyverse functions (i.e. subset data, reshape data, summarize datasets, join different datasets together, and define new variables)
  • Make static maps and interactive web maps in R Studio using the tmap package
  • Export your maps from R Studio so they can be embedded in your reports and websites

Most broadly, and most importantly, you will gain an appreciation for how to use real-world social scientific data to better understand important social problems, and use this understanding to reflect on possible ways to address these problems.

1.2 Data

The workshop makes use of the following three datasets:

  • Colorado State Patrol traffic stops data, available through the Stanford Open Policing Project (mentioned above). The data was collected by Pierson et al. (2020). The data is available at the project website here, under the “CO” heading; if you’re downloading the data straight from the website, be sure to use the “State Patrol” data.
  • A dataset of county-level demographic data (based on the 2010 decennial census), which your instructors generated prior to the workshop. See the Appendix to this tutorial to see how this census data was extracted.
  • A spatial dataset of Colorado counties, available from the US census via the Colorado GeoLibrary, a curated online archive of Colorado-related spatial data. The specific dataset we will use is available here. Note that this spatial dataset is stored as a shapefile, which is a common file format used to store spatial data.

For convenience, all of this data can be downloaded from this folder.

1.3 R and R Studio Installation

In order to follow along with the lesson, you must install both R and R Studio. Please see the installation instructions at this page, from Data Carpentry.

1.4 The R Studio Interface

If you haven’t used R Studio before, you can get a quick tour of the R Studio interface and get oriented by reading the brief Introduction to RStudio section here.

1.5 Installing and loading packages

In order to carry out the analysis in the workshop, you must install the tidyverse (a suite of R packages that facilitate data science and analysis), sf (a package used to load and process spatial datasets in R), and tmap (a package that allows one to create maps in R). If you haven’t already installed one or more of these packages, you can do so by placing the package name in quotes as an argument to the install.packages function.

For example, if you wanted to install tmap, you would print the following in your R Studio console, or run it from a script (which you can open by clicking File, selecting New File, and then clicking R Script):

install.packages("tmap")

After installing the required package, you must load the packages into memory, which you can do with the following:

# Loads libraries
library(tidyverse)
library(tmap)
library(sf)

Note that you only need to install packages once on a given computer, and (usually) do not need to reinstall packages after quitting an R session and opening up a new one. However, you do need to reload any necessary libraries each time you open a new R session.

If you would like additional information about how R packages work and the installation process, please see the Installing Packages section in this Data Carpentry tutorial: https://datacarpentry.org/r-intro-geospatial/01-rstudio-intro/index.html.

1.6 Setting your working directory

Before we can bring our data into R Studio and begin the tutorial, we have to specify the location of the relevant data on our computer. This step is known as setting one’s working directory. Before setting the working directory, make sure that you’ve downloaded the data provided at the folder mentioned above to a directory on your computer.

If you’re unfamiliar with the concept of file paths, the easiest way to set your working directory is through the R Studio menu. To do so, follow these steps:

  • First, Click on the “Session” menu on the R Studio menu bar at the top of your screen, and then scroll down to the “Set Working Directory” button in the menu that opens up.
  • When you hover over the “Set Working Directory” button, a subsidiary menu that contains a button that says “Choose Directory” will open; click this “Choose Directory” button.
  • In the dialog box that opens up, navigate to the directory that contains the downloaded workshop data, select it, and click “Open”. At this point, your working directory should be set!

Alternatively, if you are familiar with the concept of file paths, and know the file path to the folder containing the downloaded datasets, you can set the working directly using the setwd() function, where the argument to the function is the relevant file path enclosed in quotation marks. For example:

# Setting the working directory programmatically
setwd("<FILE PATH TO DATA DIRECTORY HERE>")

At this point, we’re ready to begin the main part of our lesson!

2. Loading data

Let’s begin by reading in the dataset on State Patrol traffic stops released by the Stanford Open Policing Project. The data can be downloaded from this website, but has also been made available to you as part of the workshop materials as a CSV file.

Once the traffic patrol data has been downloaded into your working directory, pass the name of the file to the read_csv function, and assign it to an object. Here, we’ll assign the traffic patrol data to an object named co_traffic_stops. Note that the name of an object is arbitrary, but ideally should meaningfully describe the data that has been assigned to it.

# Read in Stanford police data for Colorado and assign to object named 
# "co_traffic_stops"
co_traffic_stops<-read_csv("co_statewide_2020_04_01.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   date = col_date(format = ""),
##   time = col_logical(),
##   subject_age = col_double(),
##   arrest_made = col_logical(),
##   citation_issued = col_logical(),
##   warning_issued = col_logical(),
##   contraband_found = col_logical(),
##   search_conducted = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.

Once the traffic patrol data has been read into R studio and assigned to an object, we can print the contents of the dataset to the console by typing the name of that object into the R Studio console (note that only the first few records will be printed)

# Print the contents of "co_traffic_stops" (i.e. the CO traffic patrol data) 
# to the console; the first few records of the dataset will print
co_traffic_stops
## # A tibble: 3,112,853 × 20
##    raw_row_number date       time  location county_name subject_age subject_race
##    <chr>          <date>     <lgl> <chr>    <chr>             <dbl> <chr>       
##  1 1947986|19479… 2013-06-19 NA    19, I70… Mesa County          26 hispanic    
##  2 1537576        2012-08-24 NA    254, H2… Jefferson …          NA <NA>        
##  3 1581594        2012-09-23 NA    115, I7… Logan Coun…          52 white       
##  4 1009205        2011-08-25 NA    197, H8… Douglas Co…          32 white       
##  5 1932619        2013-06-08 NA    107, H2… Kiowa Coun…          33 hispanic    
##  6 1179436        2011-12-23 NA    48, 384… Boulder Co…          NA <NA>        
##  7 1326795        2012-04-07 NA    0, R250… Boulder Co…          39 white       
##  8 1786795        2013-03-03 NA    19, E47… Arapahoe C…          44 white       
##  9 1552164        2012-09-02 NA    224, H2… Park County          NA <NA>        
## 10 1004281|10042… 2011-08-21 NA    R2000, … Adams Coun…          32 hispanic    
## # … with 3,112,843 more rows, and 13 more variables: subject_sex <chr>,
## #   officer_id_hash <chr>, officer_sex <chr>, type <chr>, violation <chr>,
## #   arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>,
## #   outcome <chr>, contraband_found <lgl>, search_conducted <lgl>,
## #   search_basis <chr>, raw_Ethnicity <chr>

We can also view the co_traffic_stops object (or, for that matter, any dataset in R Studio) within the R Studio data viewer by passing the name of the relevant object to the View function:

# Inspect ```co_traffic_stops``` in the R Studio data viewer
View(co_traffic_stops)

3. Cleaning and filtering data

To make things tractable (after all, the entire dataset has more than 3,000,000 observations!), let’s focus our attention on Colorado traffic patrol from the year 2010; this also has the advantage of allowing us to eventually use 2010 census data in crafting our measure of racial bias in traffic stops. Note that currently, there isn’t a separate field which contains the year in which a particular stop took place; rather, there is a “date” field, in which the date is stored in YYYY-MM-DD format. The easiest way to extract observations from the year 2010 is therefore to first extract the YYYY information from the “date” field, and use that information to make a new “Year” field, which only contains the year of a given stop; we can then extract the 2010 observations based on this newly generated “Year” field.

3.1 Create a “Year” field

To create this new “Year” field within co_traffic_stops, we can use the mutate function, which is a tidyverse function that allows us to define new variables within a dataset, and the substr function, which allows us to extract a subset of a given string.

The code below takes the data in the co_traffic_stops object, and then (%>%) creates a new field named “Year” using the mutate function; this field is set equal to the the first four digits of the existing “date” field by passing co_traffic_stops$data, 1, 4 to the substr function. It then uses the assignment operator (<-) to assign this change back to the co_traffic_stops object, which permanently updates the dataset with the addition of the new “Year” field:

# Creates "Year" field, that contains the year of a given stop, 
# in "co_traffic_stops"
co_traffic_stops<-co_traffic_stops %>% 
                    mutate(Year=substr(co_traffic_stops$date, 1,4))

Let’s check the updated co_traffice_stops object and make sure that the new field has been successfully created:

# prints contents of "co_traffic_stops"
co_traffic_stops
## # A tibble: 3,112,853 × 21
##    county_name      date       Year  raw_row_number   time  location subject_age
##    <chr>            <date>     <chr> <chr>            <lgl> <chr>          <dbl>
##  1 Mesa County      2013-06-19 2013  1947986|1947987  NA    19, I70…          26
##  2 Jefferson County 2012-08-24 2012  1537576          NA    254, H2…          NA
##  3 Logan County     2012-09-23 2012  1581594          NA    115, I7…          52
##  4 Douglas County   2011-08-25 2011  1009205          NA    197, H8…          32
##  5 Kiowa County     2013-06-08 2013  1932619          NA    107, H2…          33
##  6 Boulder County   2011-12-23 2011  1179436          NA    48, 384…          NA
##  7 Boulder County   2012-04-07 2012  1326795          NA    0, R250…          39
##  8 Arapahoe County  2013-03-03 2013  1786795          NA    19, E47…          44
##  9 Park County      2012-09-02 2012  1552164          NA    224, H2…          NA
## 10 Adams County     2011-08-21 2011  1004281|1004282… NA    R2000, …          32
## # … with 3,112,843 more rows, and 14 more variables: subject_race <chr>,
## #   subject_sex <chr>, officer_id_hash <chr>, officer_sex <chr>, type <chr>,
## #   violation <chr>, arrest_made <lgl>, citation_issued <lgl>,
## #   warning_issued <lgl>, outcome <chr>, contraband_found <lgl>,
## #   search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>

Note the newly created Year field above. We can also check to make sure that the “Year” field has been successfully created by viewing co_traffic_stops in the R Studio data viewer with View(co_traffic_stops).

3.2 Filter by year

Now that we have created the “Year” field, we can use it to extract the 2010 observations using the filter function , and assign the new dataset of 2010 stops to a new object named co_traffic_stops_2010:

# Extract 2010 observations and assign to a new object named 
# "co_traffic_stops_2010"
co_traffic_stops_2010<-co_traffic_stops %>% filter(Year==2010)

When we print the contents of the newly created co_traffic_stops_2010 object, note that the observations are now only from 2010.

# Print contents of "co_traffic_stops_2010" object
co_traffic_stops_2010
## # A tibble: 470,284 × 21
##    date       Year  county_name      subject_race raw_row_number  time  location
##    <date>     <chr> <chr>            <chr>        <chr>           <lgl> <chr>   
##  1 2010-04-17 2010  Montezuma County white        188721|188722   NA    2, 989,…
##  2 2010-04-17 2010  Montezuma County white        187958          NA    991, 32 
##  3 2010-04-17 2010  Montezuma County hispanic     188451          NA    9, 280,…
##  4 2010-04-17 2010  Montezuma County white        186989|186990|… NA    3, 277,…
##  5 2010-04-17 2010  Montezuma County white        186997|186998|… NA    3, 277,…
##  6 2010-04-17 2010  Montezuma County white        186993|186994|… NA    3, 277,…
##  7 2010-12-21 2010  Mineral County   <NA>         600865          NA    164.5, …
##  8 2010-12-21 2010  Mineral County   <NA>         600477          NA    163, 29…
##  9 2010-01-20 2010  Pueblo County    hispanic     36625|36626     NA    312, H5…
## 10 2010-01-01 2010  Chaffee County   white        275             NA    127, H2…
## # … with 470,274 more rows, and 14 more variables: subject_age <dbl>,
## #   subject_sex <chr>, officer_id_hash <chr>, officer_sex <chr>, type <chr>,
## #   violation <chr>, arrest_made <lgl>, citation_issued <lgl>,
## #   warning_issued <lgl>, outcome <chr>, contraband_found <lgl>,
## #   search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>

4. Transforming Data

Now that we have created a new data object that contains information on traffic stops from our year of interest (2010), let’s do a bit of work with the data so that we can find out the total number of traffic stops within each county, and the number of stops within each county in which the person stopped by the traffic police was Black.

4.1 Tabulate county-level count of traffic stops by race

First, let’s find out the racial breakdown of stops for each county, using the data in the “subject_race” field of co_traffice_stops_2010.

Below, we take the co_traffic_stops_2010 object, declare “county_name” as a grouping variable using group_by(county_name), and then count up the number of stops associated with each racial category within each county using count(subject_race). The dataset thats results from these operations is assigned to a new object named co_county_summary:

# Compute county-level count of traffic stops by race
co_county_summary<-co_traffic_stops_2010 %>% 
                    group_by(county_name) %>% 
                    count(subject_race) 

Let’s print the first few rows from the dataset and observe its structure:

# Prints contents of "co_county_summary"
co_county_summary
## # A tibble: 439 × 3
## # Groups:   county_name [65]
##    county_name    subject_race               n
##    <chr>          <chr>                  <int>
##  1 Adams County   asian/pacific islander   582
##  2 Adams County   black                   1208
##  3 Adams County   hispanic                8012
##  4 Adams County   other                     36
##  5 Adams County   unknown                  462
##  6 Adams County   white                  20225
##  7 Adams County   <NA>                    3825
##  8 Alamosa County asian/pacific islander    18
##  9 Alamosa County black                     43
## 10 Alamosa County hispanic                1537
## # … with 429 more rows

Note that each row contains information on the number of times a person from a given racial category was stopped by the traffic police in a given county. For example, in Adams County, CO, Asian/Pacific Islander motorists were stopped 582 times, Black motorists were stopped 1208 times, and white motorists were stopped 20225 times. The dataset contains information on the racial and ethnic breakdown of police stops for each county in Colorado, in the year 2010.

4.2 Reshape the data

Note that co_county_summary is currently a “long” dataset, in which there are multiple rows associated with each county, with each row corresponding to a distinct county/racial category combination, and the n column providing information on the number of stops associated with that county/racial category pairing.

It will be easier to instead work with a “wide” dataset, in which each county is associated with a single row, and each racial category is assigned to its own column. Each cell in this “wide” dataset would correspond to the number of stops associated with a given county (defined by the row) for a given racial category (defined by the column).

To reshape the dataset from its current “long” format into a “wide” format, we can use the pivot_wider function. The code below takes the current co_county_summary data object (in “long” format), and then transforms it into a “wide” format with pivot_wider(names_from=subject_race, values_from=n). The names_from argument specifies the name of the current column which contains the categories which we want to transform into columns (here, subject_race), while the values_from argument specifies the name of the current column which contains the values that will be associated with each county/racial category combination (here, n). Finally, the code below assigns the transformed dataset to a new object named co_county_summary_wide:

# Transforms "co_county_summary" from long format to wide, and assigns 
# the reshaped dataset to a new object named "co_county_summary_wide"
co_county_summary_wide<-co_county_summary %>% 
                        pivot_wider(names_from=subject_race, values_from=n)

Let’s print the contents of co_county_summary_wide to ensure that the data has indeed been transformed into a “wide” format:

# prints contents of "co_county_summary_wide"
co_county_summary_wide
## # A tibble: 65 × 8
## # Groups:   county_name [65]
##    county_name       `asian/pacific i…` black hispanic other unknown white  `NA`
##    <chr>                          <int> <int>    <int> <int>   <int> <int> <int>
##  1 Adams County                     582  1208     8012    36     462 20225  3825
##  2 Alamosa County                    18    43     1537     9      30  2427   414
##  3 Arapahoe County                  540  1819     1862    12     300 11089  1898
##  4 Archuleta County                  17    28      392    71      41  4125   417
##  5 Baca County                       11    61      288    NA       6   971   174
##  6 Bent County                        8    46      314     1       6  1155   278
##  7 Boulder County                   345   192     1050    10     180  9682  1594
##  8 Broomfield County                 32    22      104     3      18   690   226
##  9 Chaffee County                    43    37      361     9      71  4806  1194
## 10 Cheyenne County                   10    38      147     3       2   821    85
## # … with 55 more rows

Indeed, we can see that each row is now associated with a single county, and each column is now associated with a given racial category.

4.3 Calculate total stops for each county in co_county_summary_wide

Now that we have our dataset in “wide” format, let’s create a new column, named “total_stops” that contains information on the total number of traffic stops for each county. We can create this column by calculating the sum total of stops across all of the racial categories for each county, which yields the total number of stops for each county.

The code below takes the co_county_summary_wide object, and then calls the rowise function, which allows us to make calculations across the rows of a data frame. It then creates a new column, called “total_stops” using the now-familiar mutate function; this “total_stops” column is populated by taking the sum of traffic stops across racial categories for each county. This is accomplished with sum(c_across(where(is.integer)), na.rm=TRUE). This expression can be translated as follows: “for each row in the dataset, calculate the sum across the columns whenever the value in a column is an integer; whenever a cell value is ‘NA’, simply ignore it in the calculation.” We’ll assign the dataset, with the newly added “total_stops” column, back to the same co_county_summary_wide object; this effectively overwrites the contents of the current co_county_summary_wide object (a dataset without the “total_stops” column), with the new dataset (which does have the “total_stops” column).

# Takes the existing "co_county_summary_wide" dataset, and creates a new column 
# called "total_stops" that sums the values across columns for each row; 
# the revised dataset is assigned back to "co_county_summary_wide", 
# which overwrites the object's previous contents with the revised dataset
co_county_summary_wide<-co_county_summary_wide %>% 
                        rowwise() %>% 
                        mutate(total_stops=sum(c_across(where(is.integer)), 
                                               na.rm=TRUE)) 

Let’s print the contents of the co_county_summary_wide object and confirm that the new column has been successfully created:

# Prints updated contents of "co_county_summary_wide"
co_county_summary_wide
## # A tibble: 65 × 9
## # Rowwise:  county_name
##    county_name   total_stops `asian/pacific…` black hispanic other unknown white
##    <chr>               <int>            <int> <int>    <int> <int>   <int> <int>
##  1 Adams County        34350              582  1208     8012    36     462 20225
##  2 Alamosa Coun…        4478               18    43     1537     9      30  2427
##  3 Arapahoe Cou…       17520              540  1819     1862    12     300 11089
##  4 Archuleta Co…        5091               17    28      392    71      41  4125
##  5 Baca County          1511               11    61      288    NA       6   971
##  6 Bent County          1808                8    46      314     1       6  1155
##  7 Boulder Coun…       13053              345   192     1050    10     180  9682
##  8 Broomfield C…        1095               32    22      104     3      18   690
##  9 Chaffee Coun…        6521               43    37      361     9      71  4806
## 10 Cheyenne Cou…        1106               10    38      147     3       2   821
## # … with 55 more rows, and 1 more variable: `NA` <int>

4.4. Clean co_county_summary_wide and assign to new object

Let’s clean up the co_county_summmary_wide dataset a bit more to make it easier to work with. First, because our interest in this workshop is in exploring the possibility that Black motorists suffer disproportionately high traffic stop rates (relative to their share of the overall adult population), let’s create a new object that only contains the data that is essential for the subsequent analysis: the county’s name (“county_name”), the number of Black motorists that were stopped (“black”), and the total number of stops in the county across all racial categories ("total_stops).

The code below takes the existing co_county_summary_wide object, and then uses the select function to select the columns we want to keep. It then assigns this selection to a new object named “co_county_black_stops”:

# Selects the "county_name", "black", and "total_stops" columns 
# from the "co_county_summary_wide" object, and assigns the selection 
# to a new object named "co_county_black_stops"

co_county_black_stops<-co_county_summary_wide %>%
                        select(county_name, black, total_stops) 

Let’s open up the newly created co_county_black_stops object to ensure that these changes have been successfully implemented:

# Prints contents of "co_county_black_stops"
co_county_black_stops
## # A tibble: 65 × 3
## # Rowwise:  county_name
##    county_name       black total_stops
##    <chr>             <int>       <int>
##  1 Adams County       1208       34350
##  2 Alamosa County       43        4478
##  3 Arapahoe County    1819       17520
##  4 Archuleta County     28        5091
##  5 Baca County          61        1511
##  6 Bent County          46        1808
##  7 Boulder County      192       13053
##  8 Broomfield County    22        1095
##  9 Chaffee County       37        6521
## 10 Cheyenne County      38        1106
## # … with 55 more rows

As expected, we now have a dataset that contains information on the total number of traffic stops for each Colorado county in 2010 (“total_stops”), and the number of traffic stops that involved Black motorists (“black”).

The name of the column containing information on the number of Black motorists stopped by the traffic patrol is simply “black”, which was inherited from the initial dataset from the Stanford Policing project. However, this column name is somewhat vague, and may cause confusion down the road when we work with census demographic data. Therefore, let’s rename that column name to “black_stops” using the rename function, and assign that change back to co_county_black_stops:

# Takes the existing "co_county_black_stops" object and renames the column named 
# "black" to "black_stops"; assigns the modified dataset back to 
# "co_county_black_stops", which overwrites the existing contents of 
# "co_county_black_stops"
co_county_black_stops<-co_county_black_stops %>%
                        rename(black_stops=black)

Let’s now print the modified co_county_black_stops object:

# Prints contents of "co_county_black_stops"
co_county_black_stops
## # A tibble: 65 × 3
## # Rowwise:  county_name
##    county_name       black_stops total_stops
##    <chr>                   <int>       <int>
##  1 Adams County             1208       34350
##  2 Alamosa County             43        4478
##  3 Arapahoe County          1819       17520
##  4 Archuleta County           28        5091
##  5 Baca County                61        1511
##  6 Bent County                46        1808
##  7 Boulder County            192       13053
##  8 Broomfield County          22        1095
##  9 Chaffee County             37        6521
## 10 Cheyenne County            38        1106
## # … with 55 more rows

Note that the name of the column has successfully been changed, and is now more descriptive.

Finally, if we view co_county_black_stops within the R Studio data viewer (View(co_county_black_stops)), we’ll note a small quirk in the dataset. In particular, while Colorado has 64 counties, the dataset has 65 rows; one of the rows is an extra row, with an “NA” value associated with the “county_name” field, and 20 total traffic stops associated with it. Because we’re interested in a county-level analysis, and these stops are not associated with an actual county, we’ll go ahead and delete that row with the following code, which takes the current co_county_black_stops object, and then uses the filter function to select only those rows for which “county_name” is NOT EQUAl (!=) to “NA”; after effectively deleting the row where “county_name” is set to “NA”, the code assigns the modified dataset back to co_county_black_stops:

# Takes the "co_county_black_stops" object and removes the row for which 
# "county_name" is "NA"; assigns the modified dataset back to 
# "co_county_black_stops", which overwrites the dataset that is 
# currently assigned to that object
co_county_black_stops<-co_county_black_stops %>% 
                        filter(county_name!="NA")

Note that co_county_black_stops now contains 64 rows:

# Prints contents of "co_county_black_stops"
co_county_black_stops
## # A tibble: 64 × 3
## # Rowwise:  county_name
##    county_name       black_stops total_stops
##    <chr>                   <int>       <int>
##  1 Adams County             1208       34350
##  2 Alamosa County             43        4478
##  3 Arapahoe County          1819       17520
##  4 Archuleta County           28        5091
##  5 Baca County                61        1511
##  6 Bent County                46        1808
##  7 Boulder County            192       13053
##  8 Broomfield County          22        1095
##  9 Chaffee County             37        6521
## 10 Cheyenne County            38        1106
## # … with 54 more rows

5. Defining an index of racial bias in traffic stops

Let’s briefly take stock of where we are. We started with a massive dataset (over 3,000,000 observations) of traffic patrol stops in the state of Colorado over the course of almost a decade. Over several steps, we have worked our way to a county-level dataset containing information on the total number of traffoc stops, and the number of those stops involving a Black motorist, over the course of the year 2010. That dataset looks something like this:

# Prints contents of "co_county_black_stops"
co_county_black_stops
## # A tibble: 64 × 3
## # Rowwise:  county_name
##    county_name       black_stops total_stops
##    <chr>                   <int>       <int>
##  1 Adams County             1208       34350
##  2 Alamosa County             43        4478
##  3 Arapahoe County          1819       17520
##  4 Archuleta County           28        5091
##  5 Baca County                61        1511
##  6 Bent County                46        1808
##  7 Boulder County            192       13053
##  8 Broomfield County          22        1095
##  9 Chaffee County             37        6521
## 10 Cheyenne County            38        1106
## # … with 54 more rows

Now, let’s think about how to use this information to develop a measure of the extent to which traffic police stops may have been driven by racial bias (whether conscious or unconscious) against Black motorists. We might expect that in a discrimination-free world, the share of traffic stops for a given racial group will reflect their share of the overall adult (over-17) population (we’ll consider the over-17 population as our baseline, since that is the demographic eligible to drive). For example, if the Black percentage of a given county’s adult population is 5%, we could form a simple baseline expectation that in a discrimination-free world where “driving while Black” is not effectively criminalized, the percentage of traffic stops involving Black motorists would not exceed 5%. To the extent that the percentage of traffic stops involving Black motorists did exceed 5%, we might assume a statistical pattern consistent with racial discrimination.

Based on this logic, a simple county-level indicator of anti-Black bias in traffic stops would simply be the difference between the percentage of traffic stops in a given county involving Black motorists, and the percentage of the county’s overall population that is Black:

County-Level Traffic Stop Bias Index= (Percentage of County Traffic Stops Involving Black Motorists)-(Percentage of County’s Adult Population that is Black)

As the value of this difference rises above 0, and we see higher positive values for the bias index, we might infer higher degrees of discrimination in traffic stops.

Of course, this is a very simple index, and sets aside many complexities (for example, commuting and driving patterns, among other things). However, despite its simplicity, the intuition behind the index is often used in the social science literature on the topic (Stelter et al., 2021); at least to a first approximation, therefore, this index will give us a meaningful way to identify counties in which the behavior of Traffic Patrol might merit greater public scrutiny, on account of disproportionately high stop-rates for the Black population.

Let’s create this index for Colorado counties in the year 2010. Note that we already have the data to compute the first part of the bias index (i.e. the percentage of traffic stops involving Black motorists); this is contained in co_county_black_stops:

# prints contents of "co_county_black_stops"
co_county_black_stops
## # A tibble: 64 × 3
## # Rowwise:  county_name
##    county_name       black_stops total_stops
##    <chr>                   <int>       <int>
##  1 Adams County             1208       34350
##  2 Alamosa County             43        4478
##  3 Arapahoe County          1819       17520
##  4 Archuleta County           28        5091
##  5 Baca County                61        1511
##  6 Bent County                46        1808
##  7 Boulder County            192       13053
##  8 Broomfield County          22        1095
##  9 Chaffee County             37        6521
## 10 Cheyenne County            38        1106
## # … with 54 more rows

Let’s therefore turn to calculating the second part of the index: the percentage of each county’s population that is Black. We’ll carry out this calculation using demographic data from the 2010 decennial census.

5.1 Read in and join 2010 census data to co_county_black_stops

First, let’s read in the census dataset that was provided to you at the start of workshop (please see the Appendix to see how the data was extracted) using the read_csv function, and assign it to an object named co_counties_census_2010:

# Reads in census data and assigns to object named "co_counties_census_2010"
co_counties_census_2010<-read_csv("co_county_decennial_census.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   GEOID = col_character(),
##   County = col_character(),
##   total_pop = col_double(),
##   total_black_pop_over17 = col_double(),
##   total_pop_over17 = col_double()
## )

Let’s print the contents of this census dataset:

# Prints contents of "co_counties_census_2010"
co_counties_census_2010
## # A tibble: 64 × 5
##    GEOID County          total_pop total_black_pop_over17 total_pop_over17
##    <chr> <chr>               <dbl>                  <dbl>            <dbl>
##  1 08023 Costilla County      3524                     18             2788
##  2 08025 Crowley County       5823                    556             5034
##  3 08027 Custer County        4255                     37             3525
##  4 08029 Delta County        30952                    139            24101
##  5 08031 Denver County      600158                  45338           471392
##  6 08035 Douglas County     285465                   2447           198453
##  7 08033 Dolores County       2064                      4             1602
##  8 08049 Grand County        14843                     43            11825
##  9 08039 Elbert County       23086                    122            17232
## 10 08041 El Paso County     622263                  27280           459587
## # … with 54 more rows

Note that it contains county-level information on the overall population (i.e. all age groups and all racial categories), the adult (over-17) population (for all racial groups), and the adult (over-17) Black population. We can use the latter two variables to compute the second part of our bias index.

5.2 Join census data to co_county_black_stops

Before doing so, however, let’s join the census dataset we just viewed (co_counties_census_2010) to the dataset containing information on county-level police stops (co_county_black_stops); this will allow us to calculate the bias index using the information from the single dataset that results from this join.

The code below uses the full_join function to join co_counties_census_2010 (the second argument), to co_county_black_stops (the first argument), using the “county_name” column (from co_county_black_stops) and the “County” column (from co_counties_census_2010) as the join fields; it assigns the dataset which results from the join to a new object named co_counties_census_trafficstops:

# Joins "co_counties_census_2010" to "co_counties_black_stops" using the
# "county_name" and "County" fields as the join fields; assigns the product 
# of the join to an object named "co_counties_census_trafficstops"

co_counties_census_trafficstops<-
  full_join(co_county_black_stops, co_counties_census_2010,
            by=c("county_name"="County"))

When we open co_counties_census_trafficstops, we should see information from both of the constituent datasets (co_county_black_stops and co_counties_census_2010):

# Prints contents of "co_counties_census_trafficstops"
co_counties_census_trafficstops
## # A tibble: 64 × 7
## # Rowwise:  county_name
##    county_name       black_stops total_stops GEOID total_pop total_black_pop_ov…
##    <chr>                   <int>       <int> <chr>     <dbl>               <dbl>
##  1 Adams County             1208       34350 08001    441603                9396
##  2 Alamosa County             43        4478 08003     15445                 142
##  3 Arapahoe County          1819       17520 08005    572003               40558
##  4 Archuleta County           28        5091 08007     12084                  19
##  5 Baca County                61        1511 08009      3788                  15
##  6 Bent County                46        1808 08011      6499                 486
##  7 Boulder County            192       13053 08013    294567                1961
##  8 Broomfield County          22        1095 08014     55889                 415
##  9 Chaffee County             37        6521 08015     17809                 264
## 10 Cheyenne County            38        1106 08017      1836                   4
## # … with 54 more rows, and 1 more variable: total_pop_over17 <dbl>

5.3 Define the variables that will be used in the bias index

Now that we have all the information in one dataset, let’s calculate the constituent parts of the bias index. The code below takes the current co_counties_census_trafficstops object that we created above, and then uses the mutate function to create two new variables. The first variable, named “black_stop_pct”, is defined as the percentage of traffic stops that involved a Black motorist. The second variable, named “black_pop_pct”, is defined as the percentage of a given county’s adult population that is Black. The dataset that includes these two new variables is assigned back to co_counties_census_trafficstops:

# Takes "co_counties_census_trafficstops" and then creates two new variables; 
# one new variable is named "black_stop_pct" (the Black percentage of 
# traffic stops) and the other is "black_pop_pct" (the Black percentage 
# of the over-17 population; assigns the updated dataset back to
# co_counties_census_trafficstops"

co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
      mutate(black_stop_pct=((black_stops/total_stops)*100),
             black_pop_pct=((total_black_pop_over17/total_pop_over17)*100))

Having created these two new variables (i.e. the components of the bias index we’ll calculate below) in co_counties_census_trafficstops, let’s view the contents below:

# Prints contents of "co_counties_census_trafficstops" object
co_counties_census_trafficstops
## # A tibble: 64 × 9
## # Rowwise:  county_name
##    county_name       black_stop_pct black_pop_pct black_stops total_stops GEOID
##    <chr>                      <dbl>         <dbl>       <int>       <int> <chr>
##  1 Adams County               3.52          2.98         1208       34350 08001
##  2 Alamosa County             0.960         1.22           43        4478 08003
##  3 Arapahoe County           10.4           9.55         1819       17520 08005
##  4 Archuleta County           0.550         0.196          28        5091 08007
##  5 Baca County                4.04          0.504          61        1511 08009
##  6 Bent County                2.54          9.00           46        1808 08011
##  7 Boulder County             1.47          0.846         192       13053 08013
##  8 Broomfield County          2.01          1.01           22        1095 08014
##  9 Chaffee County             0.567         1.78           37        6521 08015
## 10 Cheyenne County            3.44          0.289          38        1106 08017
## # … with 54 more rows, and 3 more variables: total_pop <dbl>,
## #   total_black_pop_over17 <dbl>, total_pop_over17 <dbl>

5.4 Calculate the bias index

Now that we have the two components of the traffic stop bias index in co_counties_census_trafficstops (“black_stop_pct”, and “black_pop_pct”), all that is left for us to do is create a new variable, which we’ll call “bias_index”, that is calculated by taking the difference between “black_stop_pct” and “black_pop_pct.” The code below creates this new “bias_index” variable in co_counties_census_trafficstops, and assigns the dataset with this new variable back to co_counties_census_trafficstops:

# Creates new variable in "co_counties_census_trafficstops" named "bias_index" 
# that is defined as the difference between the existing "black_stop_pct" and   
# "black_pop_pct" variables
co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
      mutate(bias_index=black_stop_pct-black_pop_pct)

Note that the “bias_index” variable is now in the dataset assigned to the co_counties_census_trafficstops object:

# prints contents of "co_counties_census_trafficstops"
co_counties_census_trafficstops
## # A tibble: 64 × 10
## # Rowwise:  county_name
##    county_name   bias_index black_stop_pct black_pop_pct black_stops total_stops
##    <chr>              <dbl>          <dbl>         <dbl>       <int>       <int>
##  1 Adams County       0.538          3.52          2.98         1208       34350
##  2 Alamosa Coun…     -0.262          0.960         1.22           43        4478
##  3 Arapahoe Cou…      0.832         10.4           9.55         1819       17520
##  4 Archuleta Co…      0.354          0.550         0.196          28        5091
##  5 Baca County        3.53           4.04          0.504          61        1511
##  6 Bent County       -6.45           2.54          9.00           46        1808
##  7 Boulder Coun…      0.625          1.47          0.846         192       13053
##  8 Broomfield C…      1.00           2.01          1.01           22        1095
##  9 Chaffee Coun…     -1.21           0.567         1.78           37        6521
## 10 Cheyenne Cou…      3.15           3.44          0.289          38        1106
## # … with 54 more rows, and 4 more variables: GEOID <chr>, total_pop <dbl>,
## #   total_black_pop_over17 <dbl>, total_pop_over17 <dbl>

We can use this information to calculate an aggregated bias index for the state of Colorado as a whole in the year 2010:

# Calculates the total number of Black traffic stops in Colorado and assigns 
#the value to an object named "colorado_total_black_stops"
colorado_total_black_stops<-sum(co_counties_census_trafficstops$black_stops, 
                                na.rm=T)

# Caclulates the total number of traffic stops in Colorado and assigns the 
# value to an object named "colorado_total_stops"
colorado_total_stops<-sum(co_counties_census_trafficstops$total_stops, na.rm=T)

# Calculates the total adult Black population in CO and assigns the value to 
# an object named "colorado_17plus_blackpopulation"
colorado_adult_blackpopulation<-
  sum(co_counties_census_trafficstops$total_black_pop_over17, na.rm=T)

# Calculates the total 17+ population in CO and assigns the value to an 
# object named "colorado_17_plus_overall"
colorado_adult_overall<-sum(co_counties_census_trafficstops$total_pop_over17, 
                            na.rm=T)

# Calculates the percentage of CO traffic patrol stops and assigns the 
# value to an object named "black_stop_pct_CO"
black_stop_pct_CO<-(colorado_total_black_stops/colorado_total_stops)*100

# Calculates the percentage of the Colorado 17+ population that is Black and 
# assigns the value to an object named "black_over17_population_pct_CO"
black_over17_population_pct_CO<-(colorado_adult_blackpopulation/
                                   colorado_adult_overall)*100

# Calculates the bias index for the state of Colorado as a whole, by computing 
# the difference beween "black_stop_pct_CO" and 
# "black_over17_population_pct_CO"; assigns the result to an object 
# named "CO_bias_index"
CO_bias_index<-(black_stop_pct_CO)-(black_over17_population_pct_CO)

Now, let’s print the value of CO_bias_index:

# Prints value of "CO_bias_index"
CO_bias_index
## [1] -1.112616

Interestingly, the value of the aggregate state-level bias index is negative; specifically, this result suggests that in 2010, the Black share of traffic stops in Colorado was about 1.11 percentage points less than the Black share of Colorado’s adult population. At first glance, this might suggest that “driving while black” was not criminalized in Colorado( during the year under consideration), according to our simple bias indicator. However, it might still be the case that there are specific areas in the state where racial bias in traffic stops is a problem, and focusing on an aggregated state-level measure of the bias index obscures possible micro-level variation in patterns of anti-Black bias with respect to traffic stops. Documenting this micro-level variation is an important task, since it would allow us to identify “problem areas” that might be excessively punitive towards Black drivers. The remainder of the lesson attempts to document this micro-geography of systemic bias in traffic stops (as measured by the index we’ve defined)

5.5 Compute summary statistics for the bias index

Recall that we have a measure of the bias index for each county in Colorado; a simple way to document the spatial distribution in the bias index would be to simply compute some basic summary statistics for the “bias_index” variable in co_counties_census_trafficstops, which will give us a sense of how the bias index varies across counties. We can generate these summary statistics using the summary function.

Below, the argument to the summary function, which reads co_counties_census_trafficstops$bias_index simply specifies that we want summary statistics for the “bias_index” variable that is in the dataset assigned to the co_counties_census_trafficstops data object:

# Produces summary statistics for "bias_index" variable calculated above
summary(co_counties_census_trafficstops$bias_index)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -9.8430  0.1978  0.4727  0.5278  0.9832  4.2589       2

This table of summary statistics shows that the mean bias index across counties was 0.53 and the median value of the bias index was 0.47; this clearly suggests that even though the bias index for the state as a whole was negative, there are many counties with a positive bias index. In other words, many Colorado counties appear to have been stopping black motorists at disproportionately high rates relative to their share of the county population. Moreover, there is a fairly large range in the bias index with respect to counties; the county with the largest value for “bias_index” had an index of 4.26 (i.e. the share of Black motorists stopped by the police was 4.26 percentage points higher than the share of Black county residents in the over-17 age demographic), while the county with the smallest value for the index had an index value of -9.84 (i.e. the share of Black motorists stopped by police was 9.84 percentage points lower than the share of Black county residents in the 17+ age demographic). These basic results suggest that a few outlier parts of the state where racial bias (as we are measuring it here) does not appear to be a problem are driving down the state’s overall bias index, despite the fact that in many areas of the state, Black motorists are indeed subject to disproportionately aggressive policing. One way to quickly explore this possibility further might be to create a quick visualization that conveys county-level variation in the bias index. In Section 5.6, we’ll visualize this county-level variation in the bias index on a simple graph.

5.6 Visualize county-level variation in the bias index using ggplot

There are many possible ways to visualize county-level variation in the “bias_index” variable. Here, we’ll make a simple bar graph using ggplot, a popular visualization package that is part of the tidyverse suite of packages.

Before making this graph, let’s quickly make a new column in co_counties_census_trafficstops, named “County”, that takes the information in the existing “county_name” field (which takes the form of “ County”) and removes the part of the string that reads “County”. This will result in a cleaner-looking graph, since we can convey that our geographic units are counties in the graph’s main title.

The following code takes the dataset currently assigned to theco_counties_census_trafficstops object, and then uses the mutate function to create a new field named “County”, which is populated by taking the existing “county_name” column and using the str_remove function to remove the part of the “county_name” string that reads “County”. It then assigns the modified dataset back to the co_counties_census_trafficstops object, which overwrites the previous version of the dataset:

# Creates new variable named "County" that removes "County" string
# from "county_name" variable
co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
     mutate(County=str_remove(county_name, " County"))

Let’s confirm that the new “County” field has been successfully created:

# prints "co_counties_census_trafficstops"
co_counties_census_trafficstops
## # A tibble: 64 × 11
## # Rowwise:  county_name
##    county_name       County  bias_index black_stop_pct black_pop_pct black_stops
##    <chr>             <chr>        <dbl>          <dbl>         <dbl>       <int>
##  1 Adams County      Adams        0.538          3.52          2.98         1208
##  2 Alamosa County    Alamosa     -0.262          0.960         1.22           43
##  3 Arapahoe County   Arapah…      0.832         10.4           9.55         1819
##  4 Archuleta County  Archul…      0.354          0.550         0.196          28
##  5 Baca County       Baca         3.53           4.04          0.504          61
##  6 Bent County       Bent        -6.45           2.54          9.00           46
##  7 Boulder County    Boulder      0.625          1.47          0.846         192
##  8 Broomfield County Broomf…      1.00           2.01          1.01           22
##  9 Chaffee County    Chaffee     -1.21           0.567         1.78           37
## 10 Cheyenne County   Cheyen…      3.15           3.44          0.289          38
## # … with 54 more rows, and 5 more variables: total_stops <int>, GEOID <chr>,
## #   total_pop <dbl>, total_black_pop_over17 <dbl>, total_pop_over17 <dbl>

Now, we’re ready to use ggplot to make our bar graph.

The following code takes co_counties_trafficstops, and then:

  • Uses the drop_na function to remove counties for which the “bias_index” value has an “NA” value (there are two such counties), so that they do not show up on the graph.
  • It then enters the ggplot environment by calling the ggplot function.
  • It then uses the geom_col function to indicate that the desired output is a bar graph. Within the geom_col function, the expression that reads aes(x=County, y-bias_index) specifies which variables we want to represent on the graph, and how we want to represent those variables (i.e. which variable do we want on the x- axis, and which variable on the y-axis?). The instructions used to translate the variables in a tabular dataset into a visual representation are known as an “aesthetic mapping”, which is abbreviated within the grammar of ggplot to aes.
  • After specifiying that we want a bar graph based on the co_counties_census_trafficstops object (with the “County” column mapped to the x-axis and the “bias_index” column mapped to the y-axis), we use the labs function (short for “labels”) to designate the graph’s main title, and the labels for x and y axis.
  • In ggplot, the theme function is a versatile function that helps to customize the appearance of a ggplot object; here, we set the axis.title.x argument equal to element_text(size=9), which effectively sets the size of the x axis labels. In addition, we set the plot.title argument within the theme function equal to element_text(hjust=0.5), which effectively center-justifies the plot’s main title. Next, we use the scale_y_continuous function to set the interval breaks of the y-axis; this range defined as a vector passed to the breaks argument, where each numeric element of the vector denotes a desired interval break.
  • The expand_limits function allows us to expand the range of either axis beyond the range established by the axis interval markers; here, by specifying y=c(-10,10) as an argument to the expand_limits function, we effectively stretch out the y-axis range from -10 to 10, even though the highest Y-axis tick-mark is set at 7.5.
  • Finally, we assign the plot that is created with this code to a new object named bias_graph:
# Uses ggplot to create bar graph of "bias_index" variable, with "bias_index" 
#on Y-axis and counties on X-axis
bias_graph<-
  co_counties_census_trafficstops %>% 
  drop_na(bias_index) %>% 
    ggplot()+
      geom_col(aes(x=County, y=bias_index))+
      labs(title="Racial Discrimination in Police Stops, by County (Colorado)", 
           x="County", 
           y="Bias Index=\nBlack Percentage of Overall Traffic Stops – 
           Black Percentage of Overall Over-17 Population")+
       theme(axis.title.x = element_text(size = 9), 
             plot.title=element_text(hjust=0.5))+      
       scale_y_continuous(breaks=c(-10, -7.5, -5, -2.5, 0, 2.5, 5, 7.5))+
       expand_limits(y=c(-10,10))

Let’s print the contents of bias_graph and see what the plot looks like; it will appear in the “plots” tab on the bottom-right of your R Studio interface.

# Prints contents of "bias_graph"
bias_graph

As we can see, the graph does show that certain counties have values around zero, or well below zero on the bias index, which drives down the overall value for the state; but in many other counties, the value of the “bias_index” variable is positive and substantively large.

While the graph does convey valuable information, it’s a little bit confusing to look at; for example, the county names are squished together in a way that makes them unreadable.

A quick way to make the plot more readable is to invert the axes (such that counties are on the y-axis, and the bias index is arrayed along the x-axis). The code below is largely the same as the code in the previous code block. The differences are twofold, and contribute to a change in the map’s appearance. First, the argument to geom_col looks a bit different. Instead of simply specifying (aes(x=County, y=bias_index), we instead specify (aes(x=reorder(County, bias_index), y=bias_index). The reorder function is essentially specifying that we want the x-axis variable (“County”), to be arrayed in descending order with respect to “bias_index”. Arraying the values in descending order makes for a graph that is easier to read. Second, we call the coord_flip() function, which inverts the axes such that the x-axis and y-axis (specified as arguments to geom_col) are inverted. We’ll assign this new plot to a new object named bias_graph_inverted:

# Uses ggplot to create bar graph of "bias_index" variable, with "bias_index" 
# on X-axis and counties on y-axis; assigned to object named 
# "bias_graph_inverted"
bias_graph_inverted<-
  co_counties_census_trafficstops %>% 
    drop_na(bias_index) %>% 
      ggplot()+
        geom_col(aes(x=reorder(County, bias_index), y=bias_index))+
        coord_flip()+
       labs(
         title="Racial Discrimination in Police Stops, by County (Colorado)", 
         x="County", 
        y="Bias Index=\nBlack Percentage of Overall Traffic Stops-
        Black Percentage of Overall Over-17 Population")+
        theme(axis.title.x = element_text(size = 9), 
              plot.title=element_text(hjust=0.5))+      
        scale_y_continuous(breaks=c(-10, -7.5, -5, -2.5, 0, 2.5, 5, 7.5))+
        expand_limits(y=c(-10,10))

Let’s view the revised plot:

# prints contents of "bias_graph_inverted"
bias_graph_inverted

We can see that those relatively small changes produced a more readable and informative plot. We can get a clear sense of how the traffic stops bias index varies across Colorado counties, and identify the counties where policing practices appear excessively punitive towards Black drivers. The plot clearly conveys why focusing solely on aggregated statistics might be problematic: such measures have the potential to obscure more granular patterns of bias within jurisdictions.

6. Mapping the bias index

However, while the plot above gives us a sense of counties where the State Patrol practices may deserve greater scrutiny, it is difficult to contextualize that information without a clear sense of where these counties are located. For example, we might want to know whether there are clusters of counties with particularly high or low values for the bias index. And if so, what might explain such patterns?

In short, counties are spatial units, and creating a visualization where we can explicitly situate those those spatial units in their geographic context might prove even more informative than a plot such as the one developed in the previous section (where counties are not situated in their spatial context). In short, it would be useful to display the bias index on county-level map of Colorado, which will allow us to get a clearer sense of the granular spatial distribution of the bias index across the state. This section will walk through the process of creating such a map.

6.1 Read in and view the shapefile of CO counties

In order to visualize data on the bias index we created on a map of Colorado counties, we need to first load a spatial dataset of Colorado counties into R Studio. A spatial dataset is simply a dataset that has geographic information attached to it; this geographic information can be used to render the data as a map.

Let’s load in the spatial dataset of Colorado counties that was provided to you at the start of the workshop. In particular, the spatial dataset that was provided to you is stored as a shapefile, which is a commonly used file format to store spatial datasets. We can load a shapefile into R Studio using the st_read function from the sf package. Below, we’ll load in our shapefile, and assign it to a new object named co_counties_shapefile. Note that a given shapefile is comprised of several different files with various extensions; make sure that all of these files are in your working directory. However, the file name passed to the st_read function must have an “.shp” extension, as below:

# Reads in shapefile and assigns to object named "co_counties_shapefile"
co_counties_shapefile<-st_read("tl_2019_08_county.shp")
## Reading layer `tl_2019_08_county' from data source `/Users/adra7980/Documents/CU_workshops/gis/data/tl_2019_08_county/tl_2019_08_county.shp' using driver `ESRI Shapefile'
## Simple feature collection with 64 features and 17 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
## geographic CRS: NAD83

Upon reading in the shapefile, you’ll notice that some metadata about the shapefile is printed to the console. We see that the shapefile has a geometry type of “multipolygon” (other possible geometry types are points and lines), that there are 64 rows and 17 columns in the dataset, and that its coordinate reference system is NAD83. Coordinate reference systems are important to consider if you plan to do spatial analysis that involves calculations with spatially defined attributes; since we don’t plan to do analysis of this nature (we’re only interested in using the shapefile as a way to visualize data), we can set this concept aside for the purpose of our workshop.

Now, let’s print the contents of co_counties_shapefile:

# Prints contents of "co_counties_shapefile"
co_counties_shapefile
## Simple feature collection with 64 features and 17 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
## geographic CRS: NAD83
## First 10 features:
##                          geometry STATEFP COUNTYFP COUNTYNS GEOID     NAME
## 1  MULTIPOLYGON (((-106.8714 3...      08      109 00198170 08109 Saguache
## 2  MULTIPOLYGON (((-102.6521 4...      08      115 00198173 08115 Sedgwick
## 3  MULTIPOLYGON (((-102.5769 3...      08      017 00198124 08017 Cheyenne
## 4  MULTIPOLYGON (((-105.7969 3...      08      027 00198129 08027   Custer
## 5  MULTIPOLYGON (((-108.2952 3...      08      067 00198148 08067 La Plata
## 6  MULTIPOLYGON (((-107.9751 3...      08      111 00198171 08111 San Juan
## 7  MULTIPOLYGON (((-106.9154 3...      08      097 00198164 08097   Pitkin
## 8  MULTIPOLYGON (((-105.9751 3...      08      093 00198162 08093     Park
## 9  MULTIPOLYGON (((-106.0393 3...      08      003 00198117 08003  Alamosa
## 10 MULTIPOLYGON (((-102.2111 3...      08      099 00198165 08099  Prowers
##           NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT      ALAND
## 1  Saguache County   06      H1 G4020  <NA>   <NA>     <NA>        A 8206547705
## 2  Sedgwick County   06      H1 G4020  <NA>   <NA>     <NA>        A 1419419016
## 3  Cheyenne County   06      H1 G4020  <NA>   <NA>     <NA>        A 4605713960
## 4    Custer County   06      H1 G4020  <NA>   <NA>     <NA>        A 1913031921
## 5  La Plata County   06      H1 G4020  <NA>  20420     <NA>        A 4376255148
## 6  San Juan County   06      H1 G4020  <NA>   <NA>     <NA>        A 1003660672
## 7    Pitkin County   06      H1 G4020   233  24060     <NA>        A 2514104907
## 8      Park County   06      H1 G4020   216  19740     <NA>        A 5682182508
## 9   Alamosa County   06      H1 G4020  <NA>   <NA>     <NA>        A 1871465874
## 10  Prowers County   06      H1 G4020  <NA>   <NA>     <NA>        A 4243429484
##      AWATER    INTPTLAT     INTPTLON
## 1   4454510 +38.0316514 -106.2346662
## 2   3530746 +40.8715679 -102.3553579
## 3   8166129 +38.8356456 -102.6017914
## 4   3364150 +38.1019955 -105.3735123
## 5  25642578 +37.2873673 -107.8397178
## 6   2035929 +37.7810492 -107.6702567
## 7   6472577 +39.2175376 -106.9161587
## 8  43519840 +39.1189141 -105.7176479
## 9   1847610 +37.5684423 -105.7880414
## 10 15345176 +37.9581814 -102.3921613

Notice that this looks very much like a typical dataset, one that we can also view in R Studio’s data viewer with View(co_counties_shapefile). The key element that makes this dataset different from a conventional dataset is contained in the “geometry” column. For each county, the geometry column contains a series of latitude/longitude pairs corresponding to that county’s borders in the “real world”. These lat/long pairs are processed by GIS software, and used to render a visual representation of a county’s geographic borders that reflects its “real-world” shape and location. When each row is rendered simultaneously, the result is a map of Colorado counties that can be used as a springboard for data visualization and analysis.

Within R Studio, we can leverage the geometry information in a shapefile to visually render its geographic attributes using functions from the tmap package. The code below uses the geometry information in co_counties_shapefile to render the Colorado county polygons as a map. First, we pass the name of the spatial object we’d like to map (co_counties_shapefile) to the tm_shape function, and then use the tm_polygons function to instruct the mapping utility that the spatial data in the “geometry” column is to be rendered as polygons:

## tmap mode set to plotting
# Uses "geometry" information in "co_counties_shapefile" to create map of CO 
# counties
tm_shape(co_counties_shapefile)+
  tm_polygons()

You should see a map that looks like this in the “Plots” tab on the bottom-right of your R Studio interface.

Below, we’ll assign this basic map, which was rendered from co_counties_shapefile ,to an object named co_counties_map:

# assigns map of CO polygons to new object named "co_counties_map"
co_counties_map<-tm_shape(co_counties_shapefile)+
                    tm_polygons()

Now, whenever we want to retrieve the map, we can simply print the name of the object to which it has been assigned:

# prints contents of "co_counties_map"
co_counties_map

In using tmap to translate the geographic information in the “geometry” column of our spatial dataset of Colorado counties into a visual representation, we produced a static map; in other words, we cannot do things like pan around the map, or zoom in and out. However, if we want to render a dynamic map where such things are possible, we can simply use the tmap_mode function to change the setting of the tmap environment to “view”, as below:

# Sets tmap mode to "view"
tmap_mode("view")
## tmap mode set to interactive viewing

Now, if we open the co_counties_map object that we created earlier, tmap will render a dynamic and interactive map:

# Prints "co_counties_map" as a web map
co_counties_map

You will be able to view this interactive map in the “Viewer” tab on the bottom right of your R Studio interface. This interactive map is essentially a web map, and can easily be exported as an html file and embedded on a website (if we so choose).

If we want to switch back to making static maps, we can switch back to “plot” mode using the same tmap_mode function:

# Switches tmap mode back to "plot"
tmap_mode("plot")
## tmap mode set to plotting

Once we’re back in “plot” mode, tmap will render spatial objects as static maps. For instance, if we open co_counties_map again, it will render as a static map:

# prints "co_counties_map" as a static plot
co_counties_map

As we work with spatial data and maps in R Studio using tmap, we can easily toggle back and forth between “view” and “plot” modes, depending on our desired outputs.

6.2 Join co_counties_census_trafficstops to the shapefile of Colorado counties

Now that we’ve loaded and explored our spatial dataset of Colorado counties, let’s turn to the process of displaying the bias index on the map that we just rendered from our spatial dataset of CO counties.

In order to display our data of interest on the map of Colorado counties that we generated above, we must first join the data on the bias index to the spatial dataset; once the bias index data is in our spatial dataset, we can use tmap functions to render a map that displays this data on the county polygons.

We can implement the join between a spatial dataset and a tabular dataset in much the same was as we implemented a join between two tabular datasets above in Section 6.2 (between the traffic stops dataset and the census dataset).

Below, the first argument to the full_join function is the name of our spatial object of Colorado counties; the second argument is the name of the object which contains the “bias_index” data (which we want to merge into the shapefile). Finally, the third argument indicates that we want to join these datasets together using the field named “GEOID” (which is the same in both datasets), as the join field. We’ll assign the expanded spatial dataset that results from the join to a new object named county_shapefile_biasIndex:

# Joins "co_counties_census_trafficstops" into "co_counties_shapefile" 
# using GEOID as the join field; assigns the new joined dataset to a new 
# object named "county_shapefile_biasIndex"
county_shapefile_biasIndex<-
  full_join(co_counties_shapefile, 
            co_counties_census_trafficstops, by="GEOID")

Now, when we open county_shapefile_biasIndex, we should see the “bias_index” variable in the dataset, along with the “geometry” information needed to render a map of Colorado counties:

# prints contents of "county_shapefile_biasIndex"
county_shapefile_biasIndex
## Simple feature collection with 64 features and 27 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
## geographic CRS: NAD83
## First 10 features:
##        NAME                       geometry bias_index STATEFP COUNTYFP COUNTYNS
## 1  Saguache MULTIPOLYGON (((-106.8714 3...  0.5591577      08      109 00198170
## 2  Sedgwick MULTIPOLYGON (((-102.6521 4...  3.0473002      08      115 00198173
## 3  Cheyenne MULTIPOLYGON (((-102.5769 3...  3.1472044      08      017 00198124
## 4    Custer MULTIPOLYGON (((-105.7969 3... -0.2021878      08      027 00198129
## 5  La Plata MULTIPOLYGON (((-108.2952 3...  0.2272495      08      067 00198148
## 6  San Juan MULTIPOLYGON (((-107.9751 3...  0.8000000      08      111 00198171
## 7    Pitkin MULTIPOLYGON (((-106.9154 3...  0.3336883      08      097 00198164
## 8      Park MULTIPOLYGON (((-105.9751 3...  0.3973331      08      093 00198162
## 9   Alamosa MULTIPOLYGON (((-106.0393 3... -0.2620964      08      003 00198117
## 10  Prowers MULTIPOLYGON (((-102.2111 3...  3.2319999      08      099 00198165
##    GEOID        NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT
## 1  08109 Saguache County   06      H1 G4020  <NA>   <NA>     <NA>        A
## 2  08115 Sedgwick County   06      H1 G4020  <NA>   <NA>     <NA>        A
## 3  08017 Cheyenne County   06      H1 G4020  <NA>   <NA>     <NA>        A
## 4  08027   Custer County   06      H1 G4020  <NA>   <NA>     <NA>        A
## 5  08067 La Plata County   06      H1 G4020  <NA>  20420     <NA>        A
## 6  08111 San Juan County   06      H1 G4020  <NA>   <NA>     <NA>        A
## 7  08097   Pitkin County   06      H1 G4020   233  24060     <NA>        A
## 8  08093     Park County   06      H1 G4020   216  19740     <NA>        A
## 9  08003  Alamosa County   06      H1 G4020  <NA>   <NA>     <NA>        A
## 10 08099  Prowers County   06      H1 G4020  <NA>   <NA>     <NA>        A
##         ALAND   AWATER    INTPTLAT     INTPTLON     county_name   County
## 1  8206547705  4454510 +38.0316514 -106.2346662 Saguache County Saguache
## 2  1419419016  3530746 +40.8715679 -102.3553579 Sedgwick County Sedgwick
## 3  4605713960  8166129 +38.8356456 -102.6017914 Cheyenne County Cheyenne
## 4  1913031921  3364150 +38.1019955 -105.3735123   Custer County   Custer
## 5  4376255148 25642578 +37.2873673 -107.8397178 La Plata County La Plata
## 6  1003660672  2035929 +37.7810492 -107.6702567 San Juan County San Juan
## 7  2514104907  6472577 +39.2175376 -106.9161587   Pitkin County   Pitkin
## 8  5682182508 43519840 +39.1189141 -105.7176479     Park County     Park
## 9  1871465874  1847610 +37.5684423 -105.7880414  Alamosa County  Alamosa
## 10 4243429484 15345176 +37.9581814 -102.3921613  Prowers County  Prowers
##    black_stop_pct black_pop_pct black_stops total_stops total_pop
## 1       0.7296607     0.1705030          20        2741      6108
## 2       3.4120735     0.3647733          26         762      2379
## 3       3.4358047     0.2886003          38        1106      1836
## 4       0.8474576     1.0496454           1         118      4255
## 5       0.6191950     0.3919455          70       11305     51334
## 6       0.8000000     0.0000000           1         125       699
## 7       0.8213552     0.4876670           4         487     17148
## 8       0.7943403     0.3970072          64        8057     16206
## 9       0.9602501     1.2223466          43        4478     15445
## 10      3.7458295     0.5138297         247        6594     12551
##    total_black_pop_over17 total_pop_over17
## 1                       8             4692
## 2                       7             1919
## 3                       4             1386
## 4                      37             3525
## 5                     160            40822
## 6                       0              571
## 7                      69            14149
## 8                      52            13098
## 9                     142            11617
## 10                     47             9147

6.3 Display the bias index on a map of Colorado counties

At this point, with our “bias_index” variable included in a spatially explicit dataset, we are ready to visualize this data on a map, and observe how our measure of bias in police stops varies geographically across counties.

6.3.1 Make a rough draft of a map

Let’s start by making the simplest possible map of “bias_index”. Recall that earlier, we drew the Colorado county polygons with the following:

# Uses tmap to render Colorado polygons based on geometry information in 
# "county_shapefile_biasIndex" object
tm_shape(county_shapefile_biasIndex)+
  tm_polygons()

Now, to display actual data from the dataset on the Colorado county polygons, we can simply pass an argument to the tm_polygons function. In particular, we can specify the column that contains the data we want to represent on the map, using the col argument to the tm_polygons function. In our case, the name of the column that contains the data we want to display on the map is “bias_index”, so we will specify col="bias_index" within the tm_polygons function:

# Creates map of "bias_index" variable from "county_shapefile_biasIndex" 
# spatial object
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index")
## Variable(s) "bias_index" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

As you can see, this allowed us to display the “bias_index” data on the map of Colorado counties. Admittedly, this map is very rough; important elements of the map, such as the legend interval breaks and the color scheme, were chosen arbitrarily by tmap, and these arbitrary settings hinder the ability of the map to effectively convey information about the spatial distribution of “bias_index” across Colorado counties.

However, tmap allows us to customize our maps, and explicitly specify parameters that shape the map’s appearance. Let’s begin exploring some of these customization possibilities. In the code below, we still start out by passing county_shapefile_biasIndex to the tm_shape function (so as to specify the spatial object we’d like to map), and passing col="bias_index to the tm_polygons function (as before).

However, we’ll begin customizing the map by passing additional arguments to the tm_polygons function:

  • With respect to the color scheme, we’ll designate a yellow-to-red color palette with palette="YlOrRd", and set 0 as the the point in the data distribution that corresponds to the midpoint of the palette (in other words, the the hypothetical county where “col_bias” is exactly zero will be displayed in the color palette’s median color). For more information on color options in R (including color and palette codes), see the useful R Color cheatsheet.
  • In the next three arguments to tm_polygons, we’ll begin customizing the map’s legend. Setting textNA="No Data" specifies that the legend should label the category for counties with NA values on the bias index as “No Data”, rather than the default label, which is “Missing”. Setting n=5 establishes that we want to group our data into five distinct intervals, which means our legend will have five breaks. Finally, setting style="jenks" specifies that we want to use the jenks algorithm to decide where to place those legend breaks (or in other words, how to break up the data into five distinct intervals). For more For more information on the Jenks Natural Breaks Classification, as well as other data partition algorithms, see here.
  • Finally, we call the tm_layout function, which allows us to customize the map’s layout; within this function, we set frame=FALSE (which removes the map’s frame, or bounding box) and legend.outside=TRUE (which places the legend outside the map’s domain, so as to not interfere with the display of counties).
  tm_shape(county_shapefile_biasIndex)+
      tm_polygons(col="bias_index", 
                  palette="YlOrRd", 
                  midpoint=0,
                  textNA="No Data", 
                  n=5,
                  style="jenks")+
      tm_layout(frame=FALSE, 
               legend.outside=TRUE)

This is starting to look better; in the next two subsections, we’ll explore how to further control the map’s appearance by setting custom breaks and custom color schemes.

6.3.2 Make a map with custom breaks

Let’s say that instead of using the Jenks (or some other algorithm) to partition our data into intervals, we want to set our own data intervals. Doing so might make sense here, especially since we want to clearly differentiate counties with a bias index less than or equal to zero from those that are above zero. We can specify our legend breaks in a numeric vector (i.e. a sequence of numbers) passed to the breaks argument within the tm_polygons function. Below, we set breaks=c(-10,-5,0,2,4,5), which indicates that we want our intervals to be from -10 to -5; -5 to 0; 0 to 2; 2 to 4; and 4 to 5. Note that the “c” before the sequence of numbers bounded by parentheses is actually a function, which is used to indicate that the sequence of elements that follows must be interpreted as a vector. Also, note that we’ve removed the style="jenks" argument that we used above, since we’re using custom legend breaks (rather than breaks implemented by the jenks algorithm). Other than those changes, other elements of the code below are the same as that used in the previous code block.

  tm_shape(county_shapefile_biasIndex)+
    tm_polygons(col="bias_index", 
                palette="YlOrRd", 
                midpoint=0,
                textNA="No Data",
                breaks=c(-10,-5, 0, 2, 4, 5))+
    tm_layout(frame=FALSE, 
              legend.outside=TRUE)

6.3.3 Make a map with custom colors

So far, we’ve been using a predefined color template (“YlOrRd”) to display the range of values on the map. While this color scheme might be a good start, it can sometimes be difficult to distinguish the colors on the map. One possible way to fix this might be to explore other possible predefined color schemes, and find one that makes colors easier to distinguish given the data we have. Another possibility is to specify our own colors for the intervals we want to map. In order to do this, let’s first first define a character vector, assigned to an object named my_colors, that contains the colors we want to use (once again, a reminder color names are available on the R Color Cheatsheet):

# defines vector of colors and assigns vector to an object named "my_colors"
my_colors<-c("white", "peachpuff", "red1", "red4", "navy")

Now, we can pass this vector as an argument to the tm_polygons function. Instead of setting palette=YlOrRd (as above), we instead set palette=my_colors. The colors in the my_colors vector are assigned to the numeric intervals in order; that is, the interval from -10 to -5 is assigned the color “white” (the first color in the vector), the interval from -5 to 0 is assigned the color “peachpuff” (the second color in the vector), the interval from 0 to 2 is assigned the color “red1” (the third color in the vector), and so on. Everything else in the code remains the same as in the previous section:

tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index", 
              palette=my_colors, 
              textNA="No Data",
              breaks=c(-10,-5, 0, 2, 4, 5))+
  tm_layout(frame=FALSE, 
            legend.outside=TRUE)

We can see that these colors are easier to distinguish, which makes it easier to quickly scan the map for relevant patterns.

It is also worth reminding ourselves that it’s possible to assign the maps we create using tmap to objects. For example, let’s assign the last map we created to a new object named traffic_bias_map_continuous:

# Creates map showing variation in "bias_index" across CO counties, and assigns 
# the map to object named "traffic_stop_map_continuous"
traffic_bias_map_continuous<-
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index", 
              palette=my_colors, 
              textNA="No Data", 
              breaks=c(-10,-5, 0, 2, 4, 5))+
  tm_layout(frame=FALSE, 
            legend.outside=TRUE)

Now, we can bring up this map whenever we need it:

# Prints contents of "traffic_stop_map_continuous"
traffic_bias_map_continuous

6.3.4 Make a categorical map

So far, we have been mapping the the bias_index variable, which is a continuous variable. This has the advantage of allowing us to visualize the full extent of variation in the bias_index variable. However, there are also other ways we might visualize the data. For example, we could transform bias_index from a continuous numeric variable into a categorical variable, and visualize this categorical variable on a map.

More specifically, let’s say we want to use a map to clearly distinguish the counties where racial bias in traffic stops appears to be a problem (where “bias_index”>0) and those counties in which it does NOT appear to be a problem (where “bias_index”<=0). On the one hand, this would throw out useful information on the variation of “bias_index”, but on the other hand, it could yield a more stark and focused map.

To build such a map, the first step is to create a new categorical variable, based on the continuous bias_index variable, within county_shapefile_biasIndex. In the code below, we take the existing spatial dataset assigned to the county_shapefile_biasIndex object, and then use the mutate function to create a new variable named “apparent_bias.” This new “apparent_bias” variable is set to “Apparent Bias” for counties where “bias_index”>0, and set to “No Apparent Bias” for all other counties (i.e. where the bias index is less than or equal to zero). This is accomplished using the ifelse function. The first argument to the ifelse function is a given condition (here bias_index>0). The second argument specifies the value that the new “apparent_bias” variable should take when that condition is true; the third argument specifies the value that the “apparent_bias” variable should take when that condition is false. After creating and defining this new variable, we assign the changes back to the county_shapefile_biasIndex object, which overwrites the previous version of the dataset.

# Takes the existing dataset assigned to the "county_shapefile_biasIndex" 
# object, and creates a new variable named "apparent_bias"; this variable takes 
# on the value "Apparent Bias" where the "bias_index" variable is >0 and 
# "No Apparent Bias" where it is less than or equal to zero; these changes 
# are then assigned back to the "county_shapefile_biasIndex" object
county_shapefile_biasIndex<-
  county_shapefile_biasIndex %>% 
            mutate(apparent_bias=ifelse(bias_index>0, 
                                        "Apparent Bias", 
                                        "No Apparent Bias"))

Let’s take a look at what the new variable looks like:

# prints contents of "county_shapefile_biasIndex"
county_shapefile_biasIndex
## Simple feature collection with 64 features and 28 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
## geographic CRS: NAD83
## First 10 features:
##        NAME                       geometry bias_index    apparent_bias STATEFP
## 1  Saguache MULTIPOLYGON (((-106.8714 3...  0.5591577    Apparent Bias      08
## 2  Sedgwick MULTIPOLYGON (((-102.6521 4...  3.0473002    Apparent Bias      08
## 3  Cheyenne MULTIPOLYGON (((-102.5769 3...  3.1472044    Apparent Bias      08
## 4    Custer MULTIPOLYGON (((-105.7969 3... -0.2021878 No Apparent Bias      08
## 5  La Plata MULTIPOLYGON (((-108.2952 3...  0.2272495    Apparent Bias      08
## 6  San Juan MULTIPOLYGON (((-107.9751 3...  0.8000000    Apparent Bias      08
## 7    Pitkin MULTIPOLYGON (((-106.9154 3...  0.3336883    Apparent Bias      08
## 8      Park MULTIPOLYGON (((-105.9751 3...  0.3973331    Apparent Bias      08
## 9   Alamosa MULTIPOLYGON (((-106.0393 3... -0.2620964 No Apparent Bias      08
## 10  Prowers MULTIPOLYGON (((-102.2111 3...  3.2319999    Apparent Bias      08
##    COUNTYFP COUNTYNS GEOID        NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP
## 1       109 00198170 08109 Saguache County   06      H1 G4020  <NA>   <NA>
## 2       115 00198173 08115 Sedgwick County   06      H1 G4020  <NA>   <NA>
## 3       017 00198124 08017 Cheyenne County   06      H1 G4020  <NA>   <NA>
## 4       027 00198129 08027   Custer County   06      H1 G4020  <NA>   <NA>
## 5       067 00198148 08067 La Plata County   06      H1 G4020  <NA>  20420
## 6       111 00198171 08111 San Juan County   06      H1 G4020  <NA>   <NA>
## 7       097 00198164 08097   Pitkin County   06      H1 G4020   233  24060
## 8       093 00198162 08093     Park County   06      H1 G4020   216  19740
## 9       003 00198117 08003  Alamosa County   06      H1 G4020  <NA>   <NA>
## 10      099 00198165 08099  Prowers County   06      H1 G4020  <NA>   <NA>
##    METDIVFP FUNCSTAT      ALAND   AWATER    INTPTLAT     INTPTLON
## 1      <NA>        A 8206547705  4454510 +38.0316514 -106.2346662
## 2      <NA>        A 1419419016  3530746 +40.8715679 -102.3553579
## 3      <NA>        A 4605713960  8166129 +38.8356456 -102.6017914
## 4      <NA>        A 1913031921  3364150 +38.1019955 -105.3735123
## 5      <NA>        A 4376255148 25642578 +37.2873673 -107.8397178
## 6      <NA>        A 1003660672  2035929 +37.7810492 -107.6702567
## 7      <NA>        A 2514104907  6472577 +39.2175376 -106.9161587
## 8      <NA>        A 5682182508 43519840 +39.1189141 -105.7176479
## 9      <NA>        A 1871465874  1847610 +37.5684423 -105.7880414
## 10     <NA>        A 4243429484 15345176 +37.9581814 -102.3921613
##        county_name   County black_stop_pct black_pop_pct black_stops
## 1  Saguache County Saguache      0.7296607     0.1705030          20
## 2  Sedgwick County Sedgwick      3.4120735     0.3647733          26
## 3  Cheyenne County Cheyenne      3.4358047     0.2886003          38
## 4    Custer County   Custer      0.8474576     1.0496454           1
## 5  La Plata County La Plata      0.6191950     0.3919455          70
## 6  San Juan County San Juan      0.8000000     0.0000000           1
## 7    Pitkin County   Pitkin      0.8213552     0.4876670           4
## 8      Park County     Park      0.7943403     0.3970072          64
## 9   Alamosa County  Alamosa      0.9602501     1.2223466          43
## 10  Prowers County  Prowers      3.7458295     0.5138297         247
##    total_stops total_pop total_black_pop_over17 total_pop_over17
## 1         2741      6108                      8             4692
## 2          762      2379                      7             1919
## 3         1106      1836                      4             1386
## 4          118      4255                     37             3525
## 5        11305     51334                    160            40822
## 6          125       699                      0              571
## 7          487     17148                     69            14149
## 8         8057     16206                     52            13098
## 9         4478     15445                    142            11617
## 10        6594     12551                     47             9147

Now that we have created this new categorical variable, let’s go ahead and create a map that displays it on our map of Colorado counties. The code below looks very similar to the code used to map the original “bias_index” variable. The main difference is that instead of setting col=bias_index, we set the col argument equal to "apparent_bias (i.e. col="apparent_bias"). Another difference worth pointing out is that we use a different, bipartite color scheme (since there are now only two categories to map); this color scheme is defined by the vector c("orangered1", "white"). Finally, in the previous map, the legend’s title was taken from the name of the column that was mapped; here, having this title would be redundant, so we can remove the legend title using title="". We’ll assign this map to a new object named traffic_bias_map_categorical:

traffic_bias_map_categorical<-
  tm_shape(county_shapefile_biasIndex)+
      tm_polygons(col="apparent_bias", 
                  title="", 
                  palette=c("orangered1", "white"), 
                  textNA="No Data")+
      tm_layout(frame=FALSE, 
                legend.outside=TRUE)

If we print the contents of traffic_bias_map_categorical, we open a map that looks something like this:

traffic_bias_map_categorical

7. Refining and formatting the maps of the Colorado traffic-stop bias index

In the previous section, we created and refined two maps based on the index of racial bias in traffic stops we created earlier in the tutorial. The first map shows the spatial distribution of the continuous “bias_index” variable across counties.

It looked like this:

# prints "traffic_bias_map_continuous"
traffic_bias_map_continuous

The second map created a new categorical variable based on “bias_index”, and displayed these categories on a map.

It looked like this:

# prints "traffic_bias_map_categorical"
traffic_bias_map_categorical

In this section, we’ll continue to refine and customize these maps (for example, by adding titles, map credits, county labels, and labels and titles for the legend).

7.1 Refining the map of the continuous “bias_index” variable

Let’s start by fine-tuning the map of the continuous “bias_index” variable. As an exercise, read through the code below (without scrolling down to the map), and see if you can decipher what each line is doing, forming a mental picture of the in-progress map as you go. If you’re having trouble understanding something, look up the function’s documentation, which yoi can access by typing the name of the function, preceded by a question mark. For example, if we want to access the documentation for the tm_polygons function, we can type ?tm_polygons, which will bring up the function’s documentation (which provides a description of the various arguments to the function), in the “Help” tab on the bottom-right of the R Studio interface.

my_colors<-c("white", "peachpuff", "red1", "red4", "navy")

traffic_bias_map_continuous<-
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index", 
              palette=my_colors,
              title="(Black % of Traffic Stops) - (Black % of Population)",
              textNA="No Data", 
              breaks=c(-10,-5, 0.01, 2, 4, 5))+
  tm_layout(frame=FALSE, 
            legend.outside=TRUE,
            legend.text.size=0.68,
            legend.title.size=0.75,
            title.size=0.75,
            title.fontface = 2,
            title="Traffic Stop Racial Bias Index",
            main.title="Racial Bias in Traffic Stops, by County (Colorado)",
            main.title.position=0.03,
            main.title.size=1,
            attr.outside=TRUE)+
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ", 
            position=c(0.02,0.01),
            size=0.38)

Having read through this code, let’s see what that the resulting map looks like:

# prints updated map assigned to "traffic_stop_map_continuous" object
traffic_bias_map_continuous

Now, let’s unpack the various components of the code.

  • The code begins by defining the my_colors vector; this is the same color vector we used in Section 6.3.3, and is reproduced for convenience here.
  • traffic_bias_map_continuous<- indicates that the map created through the code on the right side of the assignment operator will be assigned to the object named traffic_bias_map_continuous. This overwrites the previous version of the map assigned to this object.
  • The first function that is called is tm_shape; we pass the name of the object that contains the data we want to map to this function, which yields tm_shape(county_shapefile_biasIndex).
  • After specifying our dataset with tm_shape, we call the tm_polygons function. The tm_polygons function takes several arguments, beginning with col="bias_index (which specifies the column within county_shapefile_biasIndex that we want to map), and palette=my_colors (which specifies that we want to use the colors from the my_colors vector to represent different interval ranges for the “bias_index” variable). Next, in title="(Black % of Traffic Stops) - (Black % of Population), we define a subtitle for the legend, so that viewers of the map can discern the intuition behind the bias index that is being mapped. Finally, textNA="No Data" defines the legend label for “NA” values, and breaks=c(-10,-5, 0.01, 2, 4, 5) sets the data intervals, which are communicated in the map’s legend. Notice that instead of setting one of the break points at 0, this version of the map uses 0.01; that is because the lower bound on data intervals is inclusive, while the upper bound is exclusive (i.e. a hypothetical county with a bias index of exactly 4.00 would be assigned the color that corresponds with the 4.00 to 5.00 legend interval, not the one corresponding to 2.00 to 4.00). As a result, a county with a bias index value of exactly zero would be grouped with counties with a positive value on the index, when it would make more sense to group such a county with counties with negative values on the index. Setting the break point slightly above 0 avoids a scenario where a county with an index value of exactly 0 is grouped together with counties with an index value greater than zero.
  • The next tmap function that is called is the tm_layout function, which allows us to specify various parameters that shape the map’s layout. Recall from before that frame=FALSE removes the map’s bounding box (which surrounds the area displayed on the map), while legend.outside=TRUE places the legend outside the map’s domain, so that it doesn’t overlap with any of the counties that are displayed. Next, legend.text.size=0.68 sets the size for legend text elements (that are not part of the legend’s title or subtitle), while legend.title.size=0.75 sets the size of the legend’s subtitle, and title.size=0.75 sets the size of the text in the legend’s main title. Note that when setting the size of text elements on your map, the best approach to use is usually trial and error; start with an arbitrary size, and then iterate from there. title="Traffic Stop Racial Bias Index" sets the text for legend’s main title, and title.fontface = 2 sets the legend’s main title text in boldface. main.title="Racial Bias in Traffic Stops, by County (Colorado)" sets the map’s main title, main.title.position=0.03 sets the position/justification of the title, and main.title.size=1 sets the size of the main title’s text. Finally, attr.outside=TRUE specifies that any map elements other than the legend (such as the map credits) are to be placed outside the map boundary.
  • The tm_credits function is a tmap function which allows us to add a credits section to the map, to convey information about things such as the name of the map author and the sources of the data used to create the map. In the code above, the first argument to the tm_credits function is simply the text we would like to include in the credits; new lines are indicated by text that reads \n. Second, position=c(0.02,0.01) specifies the position of the credits section with respect to the map (the first element of the vector, 0.02, can be thought of as the x-coordinate, while the second element, 0.01, can be thought of as a y-coordinate); these coordinates place the credits below the map on the left-hand side. Finally, size=0.38 specifies the size of the text used in the credits section.

Notice that our map currently does not have labels for county names; these can easily be added using the tm_text function that is part of the tmap package. Instead of rewriting all of the code from above, we can simply append the tm_text function and its arguments to the map object we already created above (traffic_bias_map_continuous). The code below takes the existing traffic_bias_map_continous object, which we updated above, and adds a line of code that reads tm_text("NAME", size=0.30, fontface=2). This labels the map’s polygons using information from the underlying dataset’s “NAME” field (which contains county names), using a text size of 0.3; the argument fontface=2 specifies that the county name labels are to be printed in boldface. No other changes or additions are made to traffic_bias_map_continuous. The version of the map that is labelled is assigned to a new object, named traffic_bias_map_continuous_labeled.

# Adds labels to "traffic_bias_map_continuous" and assigns new labeled map
# to new object named "traffic_bias_map_continuous_labeled"
traffic_bias_map_continuous_labeled<-
  traffic_bias_map_continuous+
    tm_text("NAME", size=0.30, fontface=2)

When we open traffic_bias_map_continuous_labeled, we can see the map with labels:

# prints contents of "traffic_bias_map_continuous_labeled"
traffic_bias_map_continuous_labeled

7.2 Refining the categorical map

The code below refines the categorical map we created in Section 6.3.4. The map is largely the same as the one created in that section, but adds and customizes the appearance of the map’s title using arguments to the tm_layout function that we have already discussed in Section 7.1, and also adds a map credits section that is identical to the one we used in 7.1. The map is assigned to the traffic_bias_map_categorical object, which overwrites the map we previously assigned to this object in Section 6.3.4.

traffic_bias_map_categorical<-
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="apparent_bias", 
              title="", 
              palette=c("orangered1", "white"), 
              textNA="No Data")+
    tm_layout(frame=FALSE, 
              legend.outside=TRUE,
              main.title="Racial Bias in Traffic Stops, by County in Colorado\n(Based on Whether % of Black Motorists Stopped by Police > County's Adult Black Population %)",
              main.title.position=0.03,
              main.title.size=0.75,
              attr.outside=TRUE)+
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ",
              position=c(0.02,0.01), # Specifies location of map credits
             size=0.38)+
  tm_text("NAME", size=0.30, fontface=2)

We can print the contents of traffic_bias_map_categorical to view the updated map in the “Plots” window:

# prints contents of "traffic_bias_map_categorical"
traffic_bias_map_categorical

7.3 Making a dynamic map of the bias index

Recall from Section 6.1 that it is possible to render both static maps and interactive maps using the tmap package. Thus far, our focus has been on creating static maps to display county-level variation in the “bias_index” variable (or the derivative “apparent_bias” variable). It’s worth briefly reminding ourselves that we can easily turn these static maps into dynamic maps by changing the tmap mode to “view.” For example, let’s say we want to display the map assigned to traffic_bias_map_continuous in a dynamic setting:

# Changes tmap mode to "view"
tmap_mode("view")
## tmap mode set to interactive viewing
# Prints "traffic_bias_map_continuous" in View mode
traffic_bias_map_continuous
## Credits not supported in view mode.

8. Exporting maps

Once you have made a map and are satisfied with its appearance, it is straightforward to export the map from R Studio to a local directory on your computer, where you can then embed your map in papers, projects, presentations, or websites.

The easiest way to export maps is by using the tmap_save function. This function allows users to specify the desired dimensions and resolution of the exported map, among other things. For our purposes, we’ll simply export the map assigned to traffic_bias_map_continuous_labeled using the default settings. Below, the first argument is the name of the map object we’d like to export; here, that is traffic_bias_map_continuous_labeled. The second argument is the desired file name and extension; it is possible to pick any number of extensions, such as PDF, jpeg, and png. Here, we’ll export the map as a png file.

tmap_save(traffic_bias_map_continuous_labeled, 
          "traffic_bias_map_continuous_labeled.png")
## Map saved to traffic_bias_map_continuous_labeled.png
## Resolution: 2448.943 by 1800.777 pixels
## Size: 8.163142 by 6.002591 inches (300 dpi)

Because we didn’t specify the path to a directory in the second argument, the map will be exported to our working directory.

Note that if we’d like to export a dynamic/interactive map, we can use the same code as above; the only difference is that we would use an “html” extension. The interactive map would then be exported to our working directory as an html file, which can then easily be embedded on a website.

9. Summary scripts

This section summarizes the code we have written over the course of the tutorial to map county-level variation in possible anti-Black bias in Colorado’s 2010 traffic patrol stops. Section 9.1 provides the script to clean and process the original dataset published by the Stanford Open Policing Project, and get that data into a form that is suitable for mapping. Section Section 9.2 provides the script to create a map of the continuous “bias_index” variable, with counties labeled. Section 9.3 provides the script to create a map of a categorical variable that indicates whether a county’s value for “bias_index” is greater than zero, or less than/equal to zero, and then make a map of this categorical variable (with counties labeled).

9.1 Summary script to prepare, clean, and process data for mapping

# Read in Stanford police data for Colorado and assign to object 
# named "co_traffic_stops" 
co_traffic_stops<-read_csv("co_statewide_2020_04_01.csv")

# Create "Year" field based on existing "date" field
co_traffic_stops<-co_traffic_stops %>% 
                    mutate(Year=substr(co_traffic_stops$date, 1,4))

# Filter 2010 observations and assign to a new object named 
# "co_traffic_stops_2010"
co_traffic_stops_2010<-co_traffic_stops %>% filter(Year==2010)

# Compute county-level count of traffic stops by race and assign to object 
# named "co_county_summary"
co_county_summary<-co_traffic_stops_2010 %>% 
                    group_by(county_name) %>% 
                    count(subject_race) 

# Reshape the data so that the racial categories are transposed
# from rows into columns and assign the result to an object named
# "co_county_summary_wide"
co_county_summary_wide<-co_county_summary %>% 
                        pivot_wider(names_from=subject_race, values_from=n)


# Creates a new column named "total_stops" in "co_county_summary_wide" that
# contains information on the total number of stops for each county 
# (across all racial categories)
co_county_summary_wide<-co_county_summary_wide %>% 
                        rowwise() %>% 
                        mutate(total_stops=
                                 sum(c_across(where(is.integer)), na.rm=TRUE))

# Selects "county_name", "black", and "total_stops" variables from
# "co_county_summary_wide"; then renames the "black" variable to 
# "black_stops" for clarity; then removes counties that are named "NA" 
#due to an error in the dataset
co_county_black_stops<-co_county_summary_wide %>%
                        select(county_name, black, total_stops) %>% 
                        rename(black_stops=black) %>% 
                        filter(county_name!="NA")

# Read in the pre-prepared demographic data from the 2010 decennial 
# census and assign to an object named "co_counties_census_2010"
co_counties_census_2010<-read_csv("co_county_decennial_census.csv")

# Join "co_counties_census_2010" to "co_county_black_stops" and assign the result
# to an object named "co_counties_census_trafficstops"
co_counties_census_trafficstops<-full_join(co_county_black_stops, 
                                           co_counties_census_2010,
                                           by=c("county_name"="County"))

# Use the information in "co_counties_census_trafficstops" to define new 
# variables that will be used to compute the racial bias index: 
# "black_stop_pct" (the black percentage of overall traffic stops within
# a county) and "black_pop_pct" (the black percentage of the county's 
#over-17 population)

co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
    mutate(black_stop_pct=((black_stops/total_stops)*100),
          black_pop_pct=((total_black_pop_over17/total_pop_over17)*100))

# Calculate the bias index and include it as a new variable in 
# "co_counties_census_trafficstops"
co_counties_census_trafficstops<-co_counties_census_trafficstops %>% 
                                  mutate(excess_stops_index=
                                           black_stop_pct-black_pop_pct)

# Reads in Colorado county shapefile and assigns the shapefile to a new object 
# named "co_counties_shapefile"
co_counties_shapefile<-st_read("tl_2019_08_county.shp")

# Join "co_counties_census_trafficstops" to "co_counties_shapefile" using 
# "GEOID" as the join field; assign the result to a new object named #
# "county_shapefile_biasIndex"
county_shapefile_biasIndex<-full_join(co_counties_shapefile, co_counties_census_trafficstops, by="GEOID")

9.2 Summary script for map of continuous “bias_index” variable

# creates color vector
my_colors<-c("white", "peachpuff", "red1", "red4", "navy") 

# make a map of the continuous "bias_index" variable
traffic_bias_map_continuous_labeled<-  
  tm_shape(county_shapefile_biasIndex)+ 
  tm_polygons(col="bias_index", 
              palette=my_colors, 
              title="(Black % of Traffic Stops) - (Black % of Population)",
              textNA="No Data", 
              n=5, 
              breaks=c(-10,-5, 0.01, 2, 4, 5))+ 
  tm_layout(frame=FALSE, 
            legend.outside=TRUE, 
            legend.text.size=0.68, 
            legend.title.size=0.75, 
            title="Traffic Stop Racial Bias Index", 
            title.size=0.75, 
            title.fontface = 2, 
            main.title="Racial Bias in Traffic Stops, by County (Colorado)",
            main.title.position=0.03,
            main.title.size=1, 
            attr.outside=TRUE)+ 
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ", 
                    position=c(0.02,0.01), 
                    size=0.38)+ # 
  tm_text("NAME", size=0.30, fontface=2)
# Prints map
traffic_bias_map_continuous_labeled

9.3 Summary script for categorical map

# Makes categorical variable
county_shapefile_biasIndex<-
  county_shapefile_biasIndex %>% 
            mutate(apparent_bias=ifelse(bias_index>0, "Apparent Bias", 
                                        "No Apparent Bias"))

# Makes categorical map and assigns to object named 
# "traffic_bias_map_categorical"
traffic_bias_map_categorical<-
  tm_shape(county_shapefile_biasIndex)+
    tm_polygons(col="apparent_bias", 
                title="", 
                palette=c("orangered1", "white"), 
                textNA="No Data")+
    tm_layout(frame=FALSE, 
              legend.outside=TRUE,
              main.title="Racial Bias in Traffic Stops, by County in Colorado\n(Based on Whether % of Black Motorists Stopped by Police > County's Adult Black Population %)",
              main.title.position=0.03,
              main.title.size=0.75,
              attr.outside=TRUE)+
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ",  
              position=c(0.02,0.01), # Specifies location of map credits
             size=0.38)+
  tm_text("NAME", size=0.30, fontface=2)
# prints categorical map
traffic_bias_map_categorical

10. Reflections and Discussion Questions

Now that we’ve created our maps, it’s worthwhile to reflect on the process and its outcomes. Consider the following:

  • How might you change the maps we made? For example, would you change the color schemes, number of intervals, title, or any other features of their appearance? If so, why? See if you can change the code we developed to implement those changes.
  • What do you think are the possible shortcomings of the “bias_index” variable we created? Can you think of an alternative measure of racial bias in traffic stops that addresses these shortcoming? Could you create it with the data we already have? If not, what data would you need to collect?
  • What spatial patterns (if any) do you notice in how the “bias_index” is distributed across counties? Are there regions of the state where several counties with high values for the bias index are clustered? What might explain these patterns?
  • What implications might these maps have for public policy?

11. Conclusion and Further Reading

If you’re interested in exploring projects resources that lie at the intersection of GIS, Black history, and the study of structural racism, ESRI’s Racial Equity Hub is a good place to start.

If you’re interested in exploring a specific example of a large-scale interdisciplinary project on systemic racism that incorporates spatial analysis, check out the Mapping Prejudice project from the University of Minnesota (Ehrman-Solberg et al, 2020). The data for that project is available here.

If you’d like to read the paper by Pierson et al (2020) for which the police stop data was originally collected, it is available here. The Stanford project site also has a useful tutorial that introduces the data and provides some guidance in analyzing it. The tutorial includes a discussion of more advanced ways to measure racial bias in stops than the one we used here, and a useful way to extend your knowledge might be to think about how to create and map those measures.

If you would like to further develop your spatial visualization and mapping skills, an excellent place to start is with the free and open-source book by Lovelace, Nowosad, and Muenchow (2021), entitled Geocomputation with R. The book is more than an introduction to making maps; it’s a comprehensive guide to spatial analysis and Geographic Information Systems more generally.

12. References

Ehrman-Solberg, K. P. Petersen, M. Mills, K. Delegard, R. Mattke. 2020. Racial Covenants in Hennepin County. University of Minnesota Data Repository. https://doi.org/10.13020/a88t-yb14

Esri. n.d. GIS for Racial Equity: Racial Equity Hub. Accessed February 15, 2022. https://gis-for-racialequity.hub.arcgis.com/

Esri. n.d. Data Classification Methods. Accessed February 14, 2021. https://pro.arcgis.com/en/pro-app/2.7/help/mapping/layer-properties/data-classification-methods.htm

Frazier, M. “R Color Cheatsheet.” Accessed February 10, 2022. https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf.

Lovelace, R, J. Nowosad, and J. Muenchow. 2021. Geocomputation With R. https://geocompr.robinlovelace.net/.

Pebesma, E. 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10(1): 439-446. https://doi.org/10.32614/RJ-2018-009.

Pierson, E., C. Simoiu, J. Overgoor, S. Corbett-Davies, D.Jenson, A. Shoemaker, V. Ramachandran, P. Barghouty, C. Phillips, R. Shroff, and S. Goel. 2020. A large-scale analysis of racial disparities in police stops across the United States. Nature Human Behaviour 4: 736-745.

R For Social Scientists: Setup. 2018-2021. Data Carpentry. https://datacarpentry.org/r-socialsci/setup.html.

Stanford Open Policing Project. 2019. Open Policing Project Tutorial. Accessed February 15, 2022. https://openpolicing.stanford.edu/tutorials/.

Stelter, M., I. Essien, C. Sander, and J. Degner. 2021. Racial bias in police traffic stops: White residents’ county-level prejudice and Sterotypes are related to disproportionate stopping of Black drivers." PsyArXiv Preprints. https://psyarxiv.com/djp8g/.

Tennekes, M. 2018. tmap: Thematic Maps in R. Journal of Statistical Software 84(6): 1-39. https://doi.org/10.18637/jss.v084.i06.

U.S. Census Bureau. 2019. Colorado Counties Shapefile. https://geo.colorado.edu/catalog/47540-5e712aeda3d91e0009f59fc7.

Walker, K. and M. Herman. 2021. “tidycensus: Load US Census Boundary and Attribute Data as tidyverse and sf-Read Data Frames.” R Package version 1.1. https://CRAN.R-project.org/package=tidycensus

Wickham, H., M. Averick, J. Bryan, W. Chang, L.D. McGowan, R. Francois, G. Grolemund et al. 2019. "Welcome to the tidycerse. Journal of Open Source Software 4(43): 1686. https://doi.org/10.21105/joss.01686.

Appendix: Using tidycensus to extract relevant census data

This section provides a script used to extract the census dataset that was read into R Studio in Section 5.1. To save time during a workshop, it is recommended to prepare the census dataset required to create the relevant index beforehand, and simply provide students with the relevant dataset. However, if you are looking for a way to extract the census dataset within R, the following script can be used as a guide.

To extract the census data required to create the bias index, load the tidyverse (which was used in the workshop tutorial) and the tidycensus package, which is an R package that allows users to pull Census Bureau data using the Census API (if tidycensus is not already installed, please install it using install.packages("tidycensus").

# Loads libraries required to extract census data
library(tidycensus)
library(tidyverse)

Before extracting data with tidycensus, you must acquire a Census API key from the Census Bureau website; once you apply for a key on the website, your key will be immediately emailed to the address you provide. Enter your census API key in R Studio with the following code, replacing the with your key:

census_api_key("INSERT HERE")

In the workshop, we were working with 2010 police stops data, so it made sense to pull demographic data from the 2010 decennial census (which would have been collected in 2009). The discussion below is therefore framed with respect to the 2010 decennial census; if you choose to use another census dataset to create the index, your code may look slightly different.

Step 1: Define your variables

First, we can generate a table that contains the various variables (and associated variable codes) for the 2010 decennial census by using the load_variables function. The arguments to the function below (2010, "sf1") indicate that we want to extract variable names and codes for the 2010 decennial census. We assign the table to an object named decennial_2010_variables, which allows us to easily view the table by using the View function, and refer back to it whenever needed.

# Variable list for 2010 Decennial
decennial_2010_variables<-load_variables(2010, "sf1")

Based on the information in decennial_2010_variables, we can identify the variable codes for the variables we want to extract. Then, we define a vector, assigned to an object named my_vars that assigns the variable codes to descriptive names; these descriptive names will be used as column names in the dataset returned by the census API call, while the variable codes will be used by tidycensus to populate the respective fields with the desired data.

For the purpose of defining the “bias_index” variable, recall that the two key variables we need are the over-17 total population, and the over-17 Black population (counted at the county level). There is no separate category in the census dataset for these measures, so we must derive them based on the data that is available.

Given the available data, to calculate the total over-17 population, we must extract data for the male under-5 population, the male 5 to 9 population, the male 10 to 14 population, and the male 15 to 17 population, and analogous measures for the female population. Subtracting these values from the total overall population (among all age groups) will yield a value for the total over-17 population.

To calculate the Black over-17 population, we will extract a variable that defines the total Black population, and a series of variables that measure the Black population for different demographic (sex/age) combinations under 17 years old; subtracting the sum of the latter variables from the total Black population yields a measure of the Black over-17 population.

The code below extracts all the variables needed to carry out these calculation:

# Define and name variables for census API call

my_vars<-c(total_pop="P001001",
           totalpop_men_u5="P012003",
           totalpop_men_5to9="P012004",
           totalpop_men_10to14="P012005",
           totalpop_men_15to17="P012006",
           totalpop_women_u5="P012027",
           totalpop_women_5to9="P012028",
           totalpop_women_10to14="P012029",
           totalpop_women_15to17="P012030",
           black_totalpop="PCT012B001",
           black_men_u1="PCT012B003",
           black_men_1="PCT012B004",
           black_men_2="PCT012B005",
           black_men_3="PCT012B006",
           black_men_4="PCT012B007",
           black_men_5="PCT012B008",
           black_men_6="PCT012B009",
           black_men_7="PCT012B010",
           black_men_8="PCT012B011",
           black_men_9="PCT012B012",
           black_men_10="PCT012B013",
           black_men_11="PCT012B014",
           black_men_12="PCT012B015",
           black_men_13="PCT012B016",
           black_men_14="PCT012B017",
           black_men_15="PCT012B018",
           black_men_16="PCT012B019",
           black_men_17="PCT012B020",
           black_women_u1="PCT012B107",
           black_women_1="PCT012B108",
           black_women_2="PCT012B109",
           black_women_3="PCT012B110",
           black_women_4="PCT012B111",
           black_women_5="PCT012B112",
           black_women_6="PCT012B113",
           black_women_7="PCT012B114",
           black_women_8="PCT012B115",
           black_women_9="PCT012B116",
           black_women_10="PCT012B117",
           black_women_11="PCT012B118",
           black_women_12="PCT012B119",
           black_women_13="PCT012B120",
           black_women_14="PCT012B121",
           black_women_15="PCT012B122",
           black_women_16="PCT012B123",
           black_women_17="PCT012B124")

Step 2: Extract the variables using tidycensus

Now that we have a vector of the variables we want to extract (along with descriptive names for those variables), we will use the get_decennial function from tidycensus to extract these variables from the 2010 decennial census. Several arguments are passed to the get_decennial function below:

  • geography="county" specifies that we want the census data to be provided at the county level
  • variables=my_vars specifies the variables we want to extract, and the names they are to be given in the dataset; this information is contained in the my_vars vector defined above
  • state=CO specifies the state for which we want to extract the data; this argument, together with the geography="county" argument, means that tidycensus will extract the specified data in my_vars at the county level for the state of Colorado.
  • survey="sf1" indicates which census dataset we would like to query; here sf1 (short for Summary File 1), indicates we are referring to the decennial census (as opposed, for example, to the American Community Survey)
  • output=wide indicates that we want the dataset with the extracted variables to be in “wide” format, wherein each variable is assigned to its own column.
  • year=2010 indicates that we are interested in data from 2010. Combined with survey="sf1", this will extract the 2010 decennial census data.

Finally, we’ll assign the extracted dataset to an object named co_counties_race:

# Issue call to Census API
co_counties_race<-
get_decennial(
  geography="county", 
  variables=my_vars,
  state="CO",
  survey="sf1",
  output="wide",
  year=2010)          
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

We can print the first few lines of the dataset to the console to view its structure, and ensure that everything looks in order:

# prints contents of "co_counties_race"
co_counties_race
## # A tibble: 64 × 48
##    GEOID NAME        total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10…
##    <chr> <chr>           <dbl>           <dbl>            <dbl>            <dbl>
##  1 08023 Costilla C…      3524              99              102              107
##  2 08025 Crowley Co…      5823              88              110              136
##  3 08027 Custer Cou…      4255              62               95              113
##  4 08029 Delta Coun…     30952             887              913             1042
##  5 08031 Denver Cou…    600158           22252            18894            15319
##  6 08035 Douglas Co…    285465           11278            13587            12664
##  7 08033 Dolores Co…      2064              58               63               71
##  8 08049 Grand Coun…     14843             416              421              431
##  9 08039 Elbert Cou…     23086             594              790              976
## 10 08041 El Paso Co…    622263           23152            23050            23252
## # … with 54 more rows, and 42 more variables: totalpop_men_15to17 <dbl>,
## #   totalpop_women_u5 <dbl>, totalpop_women_5to9 <dbl>,
## #   totalpop_women_10to14 <dbl>, totalpop_women_15to17 <dbl>,
## #   black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>,
## #   black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>,
## #   black_men_6 <dbl>, black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>,
## #   black_men_10 <dbl>, black_men_11 <dbl>, black_men_12 <dbl>, …

As always, it is also possible to view the dataset in the R Studio data viewer by running View(co_counties_race).

Step 3: Clean the tidycensus dataset

Having extracted the dataset, you may want to tidy it up depending on your needs and preferences. For example, the “NAME” field includes the name of the state, which is not really necessary here since there are only observations from Colorado in the dataset. The code below removes the state name from the NAME field, and updates the co_counties_race object with this change:

# Remove state name from name field
co_counties_race<-co_counties_race %>% 
                    separate(col=NAME, c("County", "x"), sep=",") %>% 
                     select(-x)     
# Prints contents of "co_counties_race"
co_counties_race
## # A tibble: 64 × 48
##    GEOID County      total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10…
##    <chr> <chr>           <dbl>           <dbl>            <dbl>            <dbl>
##  1 08023 Costilla C…      3524              99              102              107
##  2 08025 Crowley Co…      5823              88              110              136
##  3 08027 Custer Cou…      4255              62               95              113
##  4 08029 Delta Coun…     30952             887              913             1042
##  5 08031 Denver Cou…    600158           22252            18894            15319
##  6 08035 Douglas Co…    285465           11278            13587            12664
##  7 08033 Dolores Co…      2064              58               63               71
##  8 08049 Grand Coun…     14843             416              421              431
##  9 08039 Elbert Cou…     23086             594              790              976
## 10 08041 El Paso Co…    622263           23152            23050            23252
## # … with 54 more rows, and 42 more variables: totalpop_men_15to17 <dbl>,
## #   totalpop_women_u5 <dbl>, totalpop_women_5to9 <dbl>,
## #   totalpop_women_10to14 <dbl>, totalpop_women_15to17 <dbl>,
## #   black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>,
## #   black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>,
## #   black_men_6 <dbl>, black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>,
## #   black_men_10 <dbl>, black_men_11 <dbl>, black_men_12 <dbl>, …

Step 4: Define new variables

Now that we have a cleaned dataset with all our necessary variables, we can use these variables to generate the demographic variables needed to calculate the bias index. First, the code below defines a new variable, called “total_pop_over17”, that is calculated by subtracting the total population that is 17 and under from the total overall population:

# Create variable for total over-17 population
co_counties_race<-
  co_counties_race %>% 
     mutate(total_pop_over17=
              total_pop-totalpop_men_u5-totalpop_men_5to9-
              totalpop_men_10to14-totalpop_men_15to17-totalpop_women_u5-
              totalpop_women_5to9-totalpop_women_10to14-totalpop_women_15to17)

Then, we create a new variable named “total_black_pop_over17”, which is defined by subtracting the total Black population that is 17 and under from the total Black population:

# Create variable for total over--17 black population
co_counties_race<-
  co_counties_race %>% 
  mutate(total_black_pop_over17=
          black_totalpop-black_men_u1-black_men_1-
          black_men_2-black_men_3-black_men_4-black_men_5-black_men_6-
          black_men_7-black_men_8-black_men_9-black_men_10-black_men_11-
          black_men_12-black_men_13-black_men_14-black_men_15-black_men_16-
          black_men_17-black_women_u1-black_women_1-black_women_2-black_women_3-
          black_women_4-black_women_5-black_women_6-black_women_7-black_women_8-
          black_women_9-black_women_10-black_women_11-black_women_12-
          black_women_13-black_women_14-black_women_15-black_women_16-
          black_women_17)

Step 5: Finalize and export the dataset

Now that we have our two key variables defined, let’s clean up the dataset by removing the variables we no longer need, and only keeping the variables necessary to create the bias index. Below, we take the existing dataset assigned to co_counties_race, and select the “GEOID” and “County” variables (which serve as ID variables), “total_black_pop_over17” and “total_pop_over17” (which are used to compute the bias index), and “total_pop” (which is not necessary to create “bias_index”, but which could prove useful in exploring alternate ways of defining a bias index than the one implemented in the tutorial). The new dataset is assigned to an object named co_counties_census_2010:

#Clean data by select relevant variables for analysis, and assign selection to new object named "co_counties_census_2010"
co_counties_census_2010<-
  co_counties_race %>% 
    select(GEOID, County, total_pop, total_black_pop_over17, total_pop_over17)

Let’s view the dataset’s contents:

# prints contents of "co_counties_census_2010"
co_counties_census_2010
## # A tibble: 64 × 5
##    GEOID County          total_pop total_black_pop_over17 total_pop_over17
##    <chr> <chr>               <dbl>                  <dbl>            <dbl>
##  1 08023 Costilla County      3524                     18             2788
##  2 08025 Crowley County       5823                    556             5034
##  3 08027 Custer County        4255                     37             3525
##  4 08029 Delta County        30952                    139            24101
##  5 08031 Denver County      600158                  45338           471392
##  6 08035 Douglas County     285465                   2447           198453
##  7 08033 Dolores County       2064                      4             1602
##  8 08049 Grand County        14843                     43            11825
##  9 08039 Elbert County       23086                    122            17232
## 10 08041 El Paso County     622263                  27280           459587
## # … with 54 more rows

At this point, we now have the census data used in the tutorial. This data was exported from R Studio, and provided to workshop participants as a CSV file that was included in the workshop materials.

To export the data, use the write_csv function; below, the first argument is the name of the object which contains the dataset to be exported, and the second argument is the desired file name. The data is exported to the current working directory, and can subsequently be opened on your spreadsheet software of choice as a CSV file.

# Exports the data
write_csv(co_counties_census_2010, "co_counties_census_2010.csv")